In this recipe, we'll look at working with categorical variables in a different way. In the event that only one or two categories of the feature are important, it might be wise to avoid the extra dimensionality, which might be created if there are several categories.
There's another way to work with categorical variables. Instead of dealing with the categorical variables using OneHotEncoder
, we can use LabelBinarizer
. This is a combination of thresholding and working with categorical variables.
To show how this works, load the iris
dataset:
>>> from sklearn import datasets as d >>> iris = d.load_iris() >>> target = iris.target
Import the LabelBinarizer()
method and create an object:
>>> from sklearn.preprocessing import LabelBinarizer >>> label_binarizer = LabelBinarizer()
Now, simply transform the target outcomes to the new feature space:
>>> new_target = label_binarizer.fit_transform(target)
Let's look at new_target
and the label_binarizer
object to get a feel of what happened:
>>> new_target.shape (150, 3) >>> new_target[:5] array([[1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0]]) >>> new_target[-5:] array([[0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1]]) >>> label_binarizer.classes_ array([0, 1, 2])
The iris
target has a cardinality of 3
, that is, it has three unique values. When LabelBinarizer
converts the vector N x 1 into the vector N x C, where C is the cardinality of the N x 1 dataset, it is important to note that once the object has been fit, introducing unseen values in the transformation will throw an error:
>>> label_binarizer.transform([4]) [...] ValueError: classes [0 1 2] mismatch with the labels [4] found in the data
Zero and one do not have to represent the positive and negative instances of the target value. For example, if we want positive values to be represented by 1,000, and negative values to be represented by -1,000, we'd simply make the designation when we create label_binarizer
:
>>> label_binarizer = LabelBinarizer(neg_label=-1000, pos_label=1000) >>> label_binarizer.fit_transform(target)[:5] array([[ 1000, -1000, -1000], [ 1000, -1000, -1000], [ 1000, -1000, -1000], [ 1000, -1000, -1000], [ 1000, -1000, -1000]])