We will take a brief tour of some well-known supervised learning algorithms and see how we can apply them to the Titanic survival prediction problem described earlier.
Before we start our tour of the machine learning algorithms, we need to know a little bit about the Patsy
library. We will make use of Patsy
to design features that will be used in conjunction with scikit-learn
. Patsy
is a package for creating what are known as design matrices. These design matrices are transformations of the features in our input data. The transformations are specified by expressions known as formulas, which correspond to a specification of what features we wish the machine learning program to utilize in learning.
A simple example of this is as follows:
Suppose that we want a linear regression of y against some other variables of x, a, and b and the interaction between a and b; then, we can specify the model as follows:
import patsy as pts pts.dmatrices("y ~ x + a + b + a:b", data)
In the preceding line of code, the formula is specified by the following expression: y ~ x + a + b + a:b
.
For further reference, look at: http://patsy.readthedocs.org/en/latest/overview.html
In this section, we will introduce boilerplate code for the implementation of the various following algorithms by using Patsy
and scikit-learn
. The reason for doing this is that most of the code for the following algorithms is repeatable.
In the following sections, the workings of the algorithms will be described and the code specific to each algorithm will be provided as attachments to the chapter.
~/devel/Titanic
, we have:In [17]: %cd ~/devel/Titanic /home/youruser/devel/sandbox/Learning/Kaggle/Titanic
In [18]: import matplotlib.pyplot as plt import pandas as pd import numpy as np import patsy as pt In [19]: train_df = pd.read_csv('csv/train.csv', header=0) test_df = pd.read_csv('csv/test.csv', header=0)
Patsy
:In [21]: formula1 = 'C(Pclass) + C(Sex) + Fare' formula2 = 'C(Pclass) + C(Sex)' formula3 = 'C(Sex)' formula4 = 'C(Pclass) + C(Sex) + Age + SibSp + Parch' formula5 = 'C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)' formula6 = 'C(Pclass) + C(Sex) + Age + SibSp + C(Embarked)' formula7 = 'C(Pclass) + C(Sex) + SibSp + Parch + C(Embarked)' formula8 = 'C(Pclass) + C(Sex) + SibSp + Parch + C(Embarked)' In [23]: formula_map = {'PClass_Sex_Fare' : formula1, 'PClass_Sex' : formula2, 'Sex' : formula3, 'PClass_Sex_Age_Sibsp_Parch' : formula4, 'PClass_Sex_Age_Sibsp_Parch_Embarked' : formula5, 'PClass_Sex_Embarked' : formula6, 'PClass_Sex_Age_Parch_Embarked' : formula7, 'PClass_Sex_SibSp_Parch_Embarked' : formula8 }
We will define a function that helps us handle missing values. The following function finds the cells within the DataFrame that have null values, obtains the set of similar passengers, and sets the null value to the mean value of that feature for the set of similar passengers. Similar passengers are defined as those having the same gender and passenger class as the passengers with the null feature value.
In [24]: def fill_null_vals(df,col_name): null_passengers=df[df[col_name].isnull()] passenger_id_list = null_passengers['PassengerId'].tolist() df_filled=df.copy() for pass_id in passenger_id_list: idx=df[df['PassengerId']==pass_id].index[0] similar_passengers = df[(df['Sex']== null_passengers['Sex'][idx]) & (df['Pclass']==null_passengers['Pclass'][idx])] mean_val = np.mean(similar_passengers[col_name].dropna()) df_filled.loc[idx,col_name]=mean_val return df_filled
Here, we create filled versions of our training and test DataFrames.
Our test DataFrame is what the fitted scikit-learn
model will generate predictions on to produce output that will be submitted to Kaggle for evaluation:
In [28]: train_df_filled=fill_null_vals(train_df,'Fare') train_df_filled=fill_null_vals(train_df_filled,'Age') assert len(train_df_filled)==len(train_df) test_df_filled=fill_null_vals(test_df,'Fare') test_df_filled=fill_null_vals(test_df_filled,'Age') assert len(test_df_filled)==len(test_df)
Here is the actual implementation of the call to scikit-learn
to learn from the training data by fitting a model and then generate predictions on the test dataset. Note that even though this is boilerplate code, for the purpose of illustration, an actual call is made to a specific algorithm, in this case, DecisionTreeClassifier
.
The output data is written to files with descriptive names, for example, csv/dt_PClass_Sex_Age_Sibsp_Parch_1.csv
and csv/dt_PClass_Sex_Fare_1.csv
.
In [29]: from sklearn import metrics,svm, tree for formula_name, formula in formula_map.iteritems(): print "name=%s formula=%s" % (formula_name,formula) y_train,X_train = pt.dmatrices('Survived ~ ' + formula, train_df_filled,return_type='dataframe') y_train = np.ravel(y_train) model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5) print "About to fit..." dt_model = model.fit(X_train, y_train) print "Training score:%s" % dt_model.score(X_train,y_train) X_test=pt.dmatrix(formula,test_df_filled) predicted=dt_model.predict(X_test) print "predicted:%s" % predicted[:5] assert len(predicted)==len(test_df) pred_results = pd.Series(predicted,name='Survived') dt_results = pd.concat([test_df['PassengerId'], pred_results],axis=1) dt_results.Survived = dt_results.Survived.astype(int) results_file = 'csv/dt_%s_1.csv' % (formula_name) print "output file: %s " % results_file dt_results.to_csv(results_file,index=False)
The preceding code follows a standard recipe, and the synopsis is as follows:
Patsy
Patsy
to create design matrices for our training feature set and training label set (designated by X_train
and y_train
).scikit-learn
classifier. In this case, we use DecisionTreeClassifier
.fit(..)
method.Patsy
to create a design matrix (X_test
) for our predicted output via a call to patsy.dmatrix(..)
.X_test
design matrix, and save the results in the variable predicted.We will consider the following supervised learning algorithms:
In logistic regression, we attempt to predict the outcome of a categorical, that is, discrete-valued dependent, variable on the basis of one or more input predictor variables.
Logistic regression can be thought of as the equivalent of applying linear regression but on discrete or categorical variables. However, in the case of binary logistic regression (which applies to the Titanic problem), the function to which we're trying to fit is not a linear one as we're only trying to predict an outcome that can take only two values – 0 and 1. Using a linear function for our regression doesn't make sense as the output cannot take values between 0 and 1. Ideally, what we need to model for the regression of a binary valued output is some sort of step function for values 0 and 1. However, such a function is not well-defined and not differentiable, so an approximation with nicer properties was defined: the logistic function. The logistic function takes values between 0 and 1 but is skewed towards the extreme values of 0 and 1 and can be used as a good approximation for the regression of categorical variables.
The formal definition of the logistic regression function is as follows:
The following graph is a good illustration as to why the logistic function is suitable for binary logistic regression:
We can see that as we increase the value of our parameter a, we can get closer to taking on the 0 to 1 values and to the step function we wish to model. A simple application of the preceding function would be to set the output value to 0, if f(x) <0.5, and 1 if not.
The code for plotting the function is included in plot_logistic.py
.
A more detailed examination of the logistic regression may be found here at: http://en.wikipedia.org/wiki/Logit and http://logisticregressionanalysis.com/86-what-is-logistic-regression.
In applying logistic regression to the Titanic problem, we wish to predict a binary outcome, that is, whether a passenger survived or not.
We adapted the boilerplate code to use the sklearn.linear_model.LogisticRegression
class of scikit-learn
.
Upon submitting our data to Kaggle, the following results were obtained:
Formula |
Kaggle Score |
---|---|
C(Pclass) + C(Sex) + Fare |
0.76077 |
C(Pclass) + C(Sex) |
0.76555 |
C(Sex) |
0.76555 |
C(Pclass) + C(Sex) + Age + SibSp + Parch |
0.74641 |
C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked) |
0.75598 |
The code implementing logistic regression can be found in the run_logistic_regression_titanic.py
file.
Support vector machine (SVM) is a powerful supervised learning algorithm used for classification and regression. It is a discriminative classifier–it draws a boundary between clusters or classifications of data, so new points can be classified on the basis of the cluster that they fall into.
SVMs do not just find a boundary line; they also try to determine margins for the boundary on either side. The SVM algorithm tries to find the boundary with the largest possible margin around it.
Support vectors are points that define the largest margin around the boundary–remove these points, and possibly, a larger margin can be found.
Hence the name, support, as they support the margin around the boundary line. The support vectors matter. This is illustrated in the following diagram:
For more information on this, refer to http://winfwiki.wi-fom.de/images/c/cf/Support_vector_2.png.
To use the SVM algorithm for classification, we specify one of the following three kernels: linear, poly, and rbf (also known as radial basis functions).
Then, we import the support vector classifier (SVC):
from sklearn import svm
We then instantiate an SVM classifier, fit the model, and predict the following:
model = svm.SVC(kernel=kernel) svm_model = model.fit(X_train, y_train) X_test = pt.dmatrix(formula, test_df_filled) . . .
Upon submitting our data to Kaggle, the following results were obtained:
Formula |
Kernel Type |
Kaggle Score |
---|---|---|
C(Pclass) + C(Sex) + Fare |
poly |
0.71292 |
C(Pclass) + C(Sex) |
poly |
0.76555 |
C(Sex) |
poly |
0.76555 |
C(Pclass) + C(Sex) + Age + SibSp + Parch |
poly |
0.75598 |
C(Pclass) + C(Sex) + Age + Parch + C(Embarked) |
poly |
0.77512 |
C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(embarked) |
poly |
0.79426 |
C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked) |
rbf |
0.7512 |
The code can be seen in its entirety in the following file: run_svm_titanic.py
.
Here, we see that the SVM with a kernel type of poly (polynomial) and the combination of Pclass, Sex, Age, Sibsp, and Parch features produces the best results when submitted to Kaggle. Surprisingly, it seems as if the embarkation point (Embarked) and whether the passenger travelled alone or with family members (Sibsp + Parch) do have a material effect on a passenger's chances of survival.
The latter effect was probably due to the women-and-children-first policy on the Titanic.
The basic idea behind decision trees is to use the training dataset to create a tree of decisions in order to make a prediction.
It recursively splits the training dataset into subsets on the basis of the value of a single feature. Each split corresponds to a node in the decision tree. The splitting process is continued until every subset is pure, that is, all elements belong to a single class. This always works except in cases where there are duplicate training examples that fall into different classes. In this case, the majority class wins.
The end result is a rule set for making predictions on the test dataset.
Decision trees encode a sequence of binary choices in a process that mimics how a human might classify things, but decide which question is most useful at each step by using the information criteria.
An example of this would be if you wished to determine whether an animal x is a mammal, fish, or a reptile; in this case, we would ask the following questions:
- Does x have fur? Yes: x is a mammal No: Does x have feathers? Yes: x is a bird No: Does x have scales? Yes: Does x have gills? Yes: x is a fish No: x is a reptile No: x is an amphibian
This generates a decision tree that looks similar to the following:
The binary splitting of questions at each node is the essence of a decision tree algorithm. A major drawback of decision trees is that they can overfit the data.
They are so flexible that given a large depth, they can memorize the inputs, and this results in poor results when they are used to classify unseen data.
The way to fix this is to use multiple decision trees, and this is known as using an ensemble estimator. An example of an ensemble estimator is the random forest algorithm, which we will address next.
To use a decision tree in scikit-learn
, we import the tree
module:
from sklearn import tree
We then instantiate an SVM classifier, fit the model, and predict the following:
model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5) dt_model = model.fit(X_train, y_train)X_test = dt.dmatrix(formula, test_df_filled) #. . .
Upon submitting our data to Kaggle, the following results are obtained:
Formula |
Kaggle Score |
---|---|
C(Pclass) + C(Sex) + Fare |
0.77033 |
C(Pclass) + C(Sex) |
0.76555 |
C(Sex) |
0.76555 |
C(Pclass) + C(Sex) + Age + SibSp + Parch |
0.76555 |
C(Pclass) + C(Sex) + Age + Parch + C(Embarked) |
0.78947 |
C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked) |
0.79426 |
The random forest is an example of a non-parametric model as are decision trees. Random forests are based on decision trees. The decision boundary is learned from the data itself. It doesn't have to be a line or a polynomial or radial basis function. The random forest model builds upon the decision tree concept by producing a large number of or a forest of decision trees. It takes a random sample of the data and identifies a set of features to grow each decision tree. The error rate of the model is compared across sets of decision trees to find the set of features that produces the strongest classification model.
To use a random forest in scikit-learn
, we import the RandomForestClassifier
module:
from sklearn import RandomForestClassifier
We then instantiate a random forest
classifier, fit the model, and predict the following:
model = RandomForestClassifier(n_estimators=num_estimators, random_state=0) rf_model = model.fit(X_train, y_train) X_test = dt.dmatrix(formula, test_df_filled) . . .
Upon submitting our data to Kaggle (Formula: C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked)), the following results are obtained:
Formula |
Kaggle Score |
---|---|
10 |
0.74163 |
100 |
0.76077 |
1000 |
0.76077 |
10000 |
0.77990 |
100000 |
0.77990 |