Chapter 3. Feature Extraction and Preprocessing

The examples discussed in the previous chapter used simple numeric explanatory variables, such as the diameter of a pizza. Many machine learning problems require learning from observations of categorical variables, text, or images. In this chapter, you will learn basic techniques for preprocessing data and creating feature representations of these observations. These techniques can be used with the regression models discussed in Chapter 2, Linear Regression, as well as the models we will discuss in subsequent chapters.

Extracting features from categorical variables

Many machine learning problems have categorical, or nominal, rather than continuous features. For example, an application that predicts a job's salary based on its description might use categorical features such as the job's location. Categorical variables are commonly encoded using one-of-K or one-hot encoding, in which the explanatory variable is encoded using one binary feature for each of the variable's possible values.

For example, let's assume that our model has a city explanatory variable that can take one of three values: New York, San Francisco, or Chapel Hill. One-hot encoding represents this explanatory variable using one binary feature for each of the three possible cities.

In scikit-learn, the DictVectorizer class can be used to one-hot encode categorical features:

>>> from sklearn.feature_extraction import DictVectorizer
>>> onehot_encoder = DictVectorizer()
>>> instances = [
>>>     {'city': 'New York'},
>>>     {'city': 'San Francisco'},
>>>     {'city': 'Chapel Hill'}>>> ]
>>> print onehot_encoder.fit_transform(instances).toarray()
[[ 0.  1.  0.] [ 0.  0.  1.][ 1.  0.  0.]]

Note that resulting features will not necessarily be ordered in the feature vector as they were encountered. In the first training example, the city feature's value is New York. The second element in the feature vectors corresponds to the New York value and is set to 1 for the first instance. It may seem intuitive to represent the values of a categorical explanatory variable with a single integer feature, but this would encode artificial information. For example, the feature vectors for the previous example would have only one dimension. New York could be represented by 0, San Francisco by 1, and Chapel Hill by 2. This representation would encode an order for the values of the variable that does not exist in the real world; there is no natural order of cities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset