In the previous chapters we have studied several algorithms for very different tasks, from classification and regression to clustering and dimensionality reduction. We showed how we can apply these algorithms to predict results when faced with new data. That is what machine learning is all about. In this last chapter, we want to show some important concepts and methods you should take into account if you want to do real-world machine learning.
All these steps are crucial in order to obtain decent results when working with machine learning applications.
The usual scenario for learning tasks such as those presented in this book include a list of instances (represented as feature/value pairs) and a special feature (the target class) that we want to predict for future instances based on the values of the remaining features. However, the source data does not usually come in this format. We have to extract what we think are potentially useful features and convert them to our learning format. This process is called feature extraction or feature engineering, and it is an often underestimated but very important and time-consuming phase in most real-world machine learning tasks. We can identify two different steps in this task:
We can, as we did in Chapter 2, Supervised Learning, build ad hoc procedures to convert the source data. There are, however, tools that can help us to obtain a suitable representation. The Python package pandas (http://pandas.pydata.org/), for example, provides data structures and tools for data analysis. It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format.
Let's start by importing the original titanic.csv
data into a pandas DataFrame
data structure (DataFrame
is essentially a two-dimensional labeled data structure where columns can potentially include different data types and each row represents an instance). As usual, we previously import the numpy
and pyplot
packages.
>>> %pylab inline >>> import pandas as pd >>> import numpy as np >>> import matplotlib.pyplot as plt
Then we import the Titanic data with pandas.
>>> titanic = pd.read_csv('data/titanic.csv') >>> print titanic <class 'pandas.core.frame.DataFrame'> Int64Index: 1313 entries, 0 to 1312 Data columns (total 11 columns): row.names 1313 non-null values pclass 1313 non-null values survived 1313 non-null values name 1313 non-null values age 633 non-null values embarked 821 non-null values home.dest 754 non-null values room 77 non-null values ticket 69 non-null values boat 347 non-null values sex 1313 non-null values dtypes: float64(1), int64(2), object(8)
You can see that each csv
column has a corresponding feature into the DataFrame
, and that the feature type is induced from the available data. We can inspect some features to see what they look like.
>>> print titanic.head()[['pclass', 'survived', 'age', 'embarked', 'boat', 'sex']] pclass survived age embarked boat sex 0 1st 1 29.0000 Southampton 2 female 1 1st 0 2.0000 Southampton NaN female 2 1st 0 30.0000 Southampton (135) male 3 1st 0 25.0000 Southampton NaN female 4 1st 1 0.9167 Southampton 11 male
The main difficulty we have now is that scikit-learn methods expect real numbers as feature values. In Chapter 2, Supervised Learning, we used the LabelEncoder
and OneHotEncoder
preprocessing methods to manually convert certain categorical features into 1-of-K values (generating a new feature for each possible value; valued 1
if the original feature had the corresponding value and 0
otherwise). This time, we will use a similar scikit-learn method, DictVectorizer
, which automatically builds these features from the different original feature values. Moreover, we will program a method to encode a set of columns in a unique step.
>>> from sklearn import feature_extraction >>> def one_hot_dataframe(data, cols, replace=False): >>> vec = feature_extraction.DictVectorizer() >>> mkdict = lambda row: dict((col, row[col]) for col in cols) >>> vecData = pd.DataFrame(vec.fit_transform( >>> data[cols].apply(mkdict, axis=1)).toarray()) >>> vecData.columns = vec.get_feature_names() >>> vecData.index = data.index >>> if replace: >>> data = data.drop(cols, axis=1) >>> data = data.join(vecData) >>> return (data, vecData)
The one_hot_dataframe
method (based on the script at https://gist.github.com/kljensen/5452382) takes a pandas DataFrame
data structure and a list of columns and encodes each column into the necessary 1-of-K features. If the replace
parameter is True
, it will also substitute the original column with the new set. Let's see it applied to the categorical pclass
, embarked
, and sex
features (titanic_n
only contains the previously created columns):
>>> titanic,titanic_n = one_hot_dataframe(titanic, ['pclass', 'embarked', 'sex'], replace=True) >>> titanic.describe() <class 'pandas.core.frame.DataFrame'> Index: 8 entries, count to max Data columns (total 12 columns): row.names 8 non-null values survived 8 non-null values age 8 non-null values embarked 8 non-null values embarked=Cherbourg 8 non-null values embarked=Queenstown 8 non-null values embarked=Southampton 8 non-null values pclass=1st 8 non-null values pclass=2nd 8 non-null values pclass=3rd 8 non-null values sex=female 8 non-null values sex=male 8 non-null values dtypes: float64(12)
The pclass
attribute has been converted to three pclass=1st
, pclass=2nd
, pclass=3rd
features, and similarly for the other two features. Note that the embarked
feature has not disappeared, This is due to the fact that the original embarked
attribute included NaN
values, indicating a missing value; in those cases, every feature based on embarked will be valued 0
, but the original feature whose value is NaN
remains, indicating the feature is missing for certain instances. Next, we encode the remaining categorical attributes:
>>> titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 'room', 'ticket', 'boat'], replace=True)
We also have to deal with missing values, since DecisionTreeClassifier
we plan to use does not admit them on input. Pandas allow us to replace them with a fixed value using the fillna
method. We will use the mean age for the age
feature, and 0
for the remaining missing attributes.
>>> mean = titanic['age'].mean() >>> titanic['age'].fillna(mean, inplace=True) >>> titanic.fillna(0, inplace=True)
Now, all of our features (except for Name
) are in a suitable format. We are ready to build the test and training sets, as usual.
>>> from sklearn.cross_validation import train_test_split >>> titanic_target = titanic['survived'] >>> titanic_data = titanic.drop(['name', 'row.names', 'survived'], axis=1) >>> X_train, X_test, y_train, y_test = train_test_split(titanic_data, titanic_target, test_size=0.25, random_state=33)
We decided to simply drop the name
attribute, since we do not expect it to be informative about the survival status (we have one different value for each instance, so we can generalize over it). We also specified the
survived
feature as the target class, and consequently eliminated it from the training vector.
Let's see how a decision tree works with the current feature set.
>>> from sklearn import tree >>> dt = tree.DecisionTreeClassifier(criterion='entropy') >>> dt = dt.fit(X_train, y_train) >>> from sklearn import metrics >>> y_pred = dt.predict(X_test) >>> print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test, y_pred)), " " Accuracy:0.839