Chapter 4. Advanced Features

In the previous chapters we have studied several algorithms for very different tasks, from classification and regression to clustering and dimensionality reduction. We showed how we can apply these algorithms to predict results when faced with new data. That is what machine learning is all about. In this last chapter, we want to show some important concepts and methods you should take into account if you want to do real-world machine learning.

  • In real-world problems, usually data is not already expressed by attribute/float value pairs, but through more complex structures or is not structured at all. We will learn feature extraction techniques that will allow us to extract scikit-learn features from data.
  • From the initial set of available features, not all of them will be useful for our algorithms to learn from; in fact, some of them may degrade our performance. We will address the problem of selecting the most adequate feature set, a process known as feature selection.
  • Finally, as we have seen in the examples in this book, many of the machine learning algorithms have parameters that must be set in order to use them. To do that, we will review model selection techniques; that is, methods to select the most promising hyperparameters to our algorithms.

All these steps are crucial in order to obtain decent results when working with machine learning applications.

Feature extraction

The usual scenario for learning tasks such as those presented in this book include a list of instances (represented as feature/value pairs) and a special feature (the target class) that we want to predict for future instances based on the values of the remaining features. However, the source data does not usually come in this format. We have to extract what we think are potentially useful features and convert them to our learning format. This process is called feature extraction or feature engineering, and it is an often underestimated but very important and time-consuming phase in most real-world machine learning tasks. We can identify two different steps in this task:

  • Obtain features: This step involves processing the source data and extracting the learning instances, usually in the form of feature/value pairs where the value can be an integer or float value, a string, a categorical value, and so on. The method used for extraction depends heavily on how the data is presented. For example, we can have a set of pictures and generate an integer-valued feature for each pixel, indicating its color level, as we did in the face recognition example in Chapter 2, Supervised Learning. Since this is a very task-dependent job, we will not delve into details and assume we already have this setting for our examples.
  • Convert features: Most scikit-learn algorithms assume as an input a set of instances represented as a list of float-valued features. How to get these features will be the main subject of this section.

We can, as we did in Chapter 2, Supervised Learning, build ad hoc procedures to convert the source data. There are, however, tools that can help us to obtain a suitable representation. The Python package pandas (http://pandas.pydata.org/), for example, provides data structures and tools for data analysis. It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format.

Let's start by importing the original titanic.csv data into a pandas DataFrame data structure (DataFrame is essentially a two-dimensional labeled data structure where columns can potentially include different data types and each row represents an instance). As usual, we previously import the numpy and pyplot packages.

>>> %pylab inline
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt

Then we import the Titanic data with pandas.

>>> titanic = pd.read_csv('data/titanic.csv')
>>> print titanic
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names    1313  non-null values
pclass       1313  non-null values
survived     1313  non-null values
name         1313  non-null values
age          633  non-null values
embarked     821  non-null values
home.dest    754  non-null values
room         77  non-null values
ticket       69  non-null values
boat         347  non-null values
sex          1313  non-null values
dtypes: float64(1), int64(2), object(8)

You can see that each csv column has a corresponding feature into the DataFrame, and that the feature type is induced from the available data. We can inspect some features to see what they look like.

>>> print titanic.head()[['pclass', 'survived', 'age', 'embarked', 
    'boat', 'sex']]
pclass  survived      age     embarked   boat     sex
0    1st         1  29.0000  Southampton      2  female
1    1st         0   2.0000  Southampton    NaN  female
2    1st         0  30.0000  Southampton  (135)    male
3    1st         0  25.0000  Southampton    NaN  female
4    1st         1   0.9167  Southampton     11    male

The main difficulty we have now is that scikit-learn methods expect real numbers as feature values. In Chapter 2, Supervised Learning, we used the LabelEncoder and OneHotEncoder preprocessing methods to manually convert certain categorical features into 1-of-K values (generating a new feature for each possible value; valued 1 if the original feature had the corresponding value and 0 otherwise). This time, we will use a similar scikit-learn method, DictVectorizer, which automatically builds these features from the different original feature values. Moreover, we will program a method to encode a set of columns in a unique step.

>>> from sklearn import feature_extraction
>>> def one_hot_dataframe(data, cols, replace=False):
>>>     vec = feature_extraction.DictVectorizer()
>>>     mkdict = lambda row: dict((col, row[col]) for col in cols)
>>>     vecData = pd.DataFrame(vec.fit_transform(
>>>         data[cols].apply(mkdict, axis=1)).toarray())
>>>     vecData.columns = vec.get_feature_names()
>>>     vecData.index = data.index
>>>     if replace:
>>>         data = data.drop(cols, axis=1)
>>>         data = data.join(vecData)
>>>     return (data, vecData)

The one_hot_dataframe method (based on the script at https://gist.github.com/kljensen/5452382) takes a pandas DataFrame data structure and a list of columns and encodes each column into the necessary 1-of-K features. If the replace parameter is True, it will also substitute the original column with the new set. Let's see it applied to the categorical pclass, embarked, and sex features (titanic_n only contains the previously created columns):

>>> titanic,titanic_n = one_hot_dataframe(titanic, ['pclass', 
    'embarked', 'sex'], replace=True)
>>> titanic.describe()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns (total 12 columns):
row.names               8  non-null values
survived                8  non-null values
age                     8  non-null values
embarked                8  non-null values
embarked=Cherbourg      8  non-null values
embarked=Queenstown     8  non-null values
embarked=Southampton    8  non-null values
pclass=1st              8  non-null values
pclass=2nd              8  non-null values
pclass=3rd              8  non-null values
sex=female              8  non-null values
sex=male                8  non-null values
dtypes: float64(12)

The pclass attribute has been converted to three pclass=1st, pclass=2nd, pclass=3rd features, and similarly for the other two features. Note that the embarked feature has not disappeared, This is due to the fact that the original embarked attribute included NaN values, indicating a missing value; in those cases, every feature based on embarked will be valued 0, but the original feature whose value is NaN remains, indicating the feature is missing for certain instances. Next, we encode the remaining categorical attributes:

>>> titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 
    'room', 'ticket', 'boat'], replace=True)

We also have to deal with missing values, since DecisionTreeClassifier we plan to use does not admit them on input. Pandas allow us to replace them with a fixed value using the fillna method. We will use the mean age for the age feature, and 0 for the remaining missing attributes.

>>> mean = titanic['age'].mean()
>>> titanic['age'].fillna(mean, inplace=True)
>>> titanic.fillna(0, inplace=True)

Now, all of our features (except for Name) are in a suitable format. We are ready to build the test and training sets, as usual.

>>> from sklearn.cross_validation import train_test_split
>>> titanic_target = titanic['survived']
>>> titanic_data = titanic.drop(['name', 'row.names', 'survived'],
    axis=1)
>>> X_train, X_test, y_train, y_test =  
   train_test_split(titanic_data, titanic_target, test_size=0.25,
   random_state=33)

We decided to simply drop the name attribute, since we do not expect it to be informative about the survival status (we have one different value for each instance, so we can generalize over it). We also specified the survived feature as the target class, and consequently eliminated it from the training vector.

Let's see how a decision tree works with the current feature set.

>>> from sklearn import tree
>>> dt = tree.DecisionTreeClassifier(criterion='entropy')
>>> dt = dt.fit(X_train, y_train)
>>> from sklearn import metrics
>>> y_pred = dt.predict(X_test)
>>> print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test,
    y_pred)), "
"
Accuracy:0.839
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset