In the last recipe, we looked at transforming our data into the standard normal distribution. Now, we'll talk about another transformation, one that is quite different.
Instead of working with the distribution to standardize it, we'll purposely throw away data; but, if we have good reason, this can be a very smart move. Often, in what is ostensibly continuous data, there are discontinuities that can be determined via binary features.
Creating binary features and outcomes is a very useful method, but it should be used with caution. Let's use the boston
dataset to learn how to create turn values in binary outcomes.
First, load the boston
dataset:
>>> from sklearn import datasets >>> boston = datasets.load_boston() >>> import numpy as np
Similar to scaling, there are two ways to binarize features in scikit-learn:
preprocessing.binarize #(a function)
preprocessing.Binarizer #(a class)
The boston
dataset's target variable is the median value of houses in thousands. This dataset is good to test regression and other continuous predictors, but consider a situation where we want to simply predict if a house's value is more than the overall mean. To do this, we will want to create a threshold value of the mean. If the value is greater than the mean, produce a 1; if it is less, produce a 0:
>>> from sklearn import preprocessing >>> new_target = preprocessing.binarize(boston.target, threshold=boston.target.mean()) >>> new_target[:5] array([ 1., 0., 1., 1., 1.])
This was easy, but let's check to make sure it worked correctly:
>>> (boston.target[:5] > boston.target.mean()).astype(int) array([1, 0, 1, 1, 1])
Given the simplicity of the operation in NumPy, it's a fair question to ask why you will want to use the built-in functionality of scikit-learn. Pipelines, covered in the Using Pipelines for multiple preprocessing steps recipe, will go far to explain this; in anticipation of this, let's use the Binarizer
class:
>>> bin = preprocessing.Binarizer(boston.target.mean()) >>> new_target = bin.fit_transform(boston.target) >>> new_target[:5] array([ 1., 0., 1., 1., 1.])
Hopefully, this is pretty obvious; but under the hood, scikit-learn creates a conditional mask that is True
if the value in the array in question is more than the threshold. It then updates the array to 1 where the condition is met, and 0 where it is not.
Let's also learn about sparse matrices and the fit
method.
Sparse matrices are special in that zeros aren't stored; this is done in an effort to save space in memory. This creates an issue for the binarizer, so to combat it, a special condition for the binarizer for sparse matrices is that the threshold cannot be less than zero:
>>> from scipy.sparse import coo >>> spar = coo.coo_matrix(np.random.binomial(1, .25, 100)) >>> preprocessing.binarize(spar, threshold=-1) ValueError: Cannot binarize a sparse matrix with threshold < 0