Creating binary features through thresholding

In the last recipe, we looked at transforming our data into the standard normal distribution. Now, we'll talk about another transformation, one that is quite different.

Instead of working with the distribution to standardize it, we'll purposely throw away data; but, if we have good reason, this can be a very smart move. Often, in what is ostensibly continuous data, there are discontinuities that can be determined via binary features.

Getting ready

Creating binary features and outcomes is a very useful method, but it should be used with caution. Let's use the boston dataset to learn how to create turn values in binary outcomes.

First, load the boston dataset:

>>> from sklearn import datasets
>>> boston = datasets.load_boston()
>>> import numpy as np

How to do it...

Similar to scaling, there are two ways to binarize features in scikit-learn:

  • preprocessing.binarize #(a function)
  • preprocessing.Binarizer #(a class)

The boston dataset's target variable is the median value of houses in thousands. This dataset is good to test regression and other continuous predictors, but consider a situation where we want to simply predict if a house's value is more than the overall mean. To do this, we will want to create a threshold value of the mean. If the value is greater than the mean, produce a 1; if it is less, produce a 0:

>>> from sklearn import preprocessing
>>> new_target = preprocessing.binarize(boston.target, 
                   threshold=boston.target.mean())
>>> new_target[:5]
array([ 1.,  0.,  1.,  1.,  1.])

This was easy, but let's check to make sure it worked correctly:

>>> (boston.target[:5] > boston.target.mean()).astype(int)
array([1, 0, 1, 1, 1])

Given the simplicity of the operation in NumPy, it's a fair question to ask why you will want to use the built-in functionality of scikit-learn. Pipelines, covered in the Using Pipelines for multiple preprocessing steps recipe, will go far to explain this; in anticipation of this, let's use the Binarizer class:

>>> bin = preprocessing.Binarizer(boston.target.mean())
>>> new_target = bin.fit_transform(boston.target)
>>> new_target[:5]
array([ 1.,  0.,  1.,  1.,  1.])

How it works...

Hopefully, this is pretty obvious; but under the hood, scikit-learn creates a conditional mask that is True if the value in the array in question is more than the threshold. It then updates the array to 1 where the condition is met, and 0 where it is not.

There's more...

Let's also learn about sparse matrices and the fit method.

Sparse matrices

Sparse matrices are special in that zeros aren't stored; this is done in an effort to save space in memory. This creates an issue for the binarizer, so to combat it, a special condition for the binarizer for sparse matrices is that the threshold cannot be less than zero:

>>> from scipy.sparse import coo
>>> spar = coo.coo_matrix(np.random.binomial(1, .25, 100))
>>> preprocessing.binarize(spar, threshold=-1)
ValueError: Cannot binarize a sparse matrix with threshold < 0

The fit method

The fit method exists for the binarizer transformation, but it will not fit anything, it will simply return the object.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset