Binarizing label features

In this recipe, we'll look at working with categorical variables in a different way. In the event that only one or two categories of the feature are important, it might be wise to avoid the extra dimensionality, which might be created if there are several categories.

Getting ready

There's another way to work with categorical variables. Instead of dealing with the categorical variables using OneHotEncoder, we can use LabelBinarizer. This is a combination of thresholding and working with categorical variables.

To show how this works, load the iris dataset:

>>> from sklearn import datasets as d
>>> iris = d.load_iris()
>>> target = iris.target

How to do it...

Import the LabelBinarizer() method and create an object:

>>> from sklearn.preprocessing import LabelBinarizer
>>> label_binarizer = LabelBinarizer()

Now, simply transform the target outcomes to the new feature space:

>>> new_target = label_binarizer.fit_transform(target)

Let's look at new_target and the label_binarizer object to get a feel of what happened:

>>> new_target.shape
(150, 3)
>>> new_target[:5]
array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

>>> new_target[-5:]
array([[0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]])

>>> label_binarizer.classes_
array([0, 1, 2])

How it works...

The iris target has a cardinality of 3, that is, it has three unique values. When LabelBinarizer converts the vector N x 1 into the vector N x C, where C is the cardinality of the N x 1 dataset, it is important to note that once the object has been fit, introducing unseen values in the transformation will throw an error:

>>> label_binarizer.transform([4])
[...]
ValueError: classes [0 1 2] mismatch with the labels [4] found in the data

There's more...

Zero and one do not have to represent the positive and negative instances of the target value. For example, if we want positive values to be represented by 1,000, and negative values to be represented by -1,000, we'd simply make the designation when we create label_binarizer:

>>> label_binarizer = LabelBinarizer(neg_label=-1000, pos_label=1000)
>>> label_binarizer.fit_transform(target)[:5]
array([[ 1000, -1000, -1000],
       [ 1000, -1000, -1000],
       [ 1000, -1000, -1000],
       [ 1000, -1000, -1000],
       [ 1000, -1000, -1000]])

Tip

The only restriction on the positive and negative values is that they must be integers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset