In this section, we will utilize pandas to do some analysis and preprocessing of the data before submitting it as input to scikit-learn
.
In order to start our preprocessing of the data, let us read in the training dataset and examine what it looks like.
Here, we read in the training dataset into a pandas DataFrame and display the first rows:
In [2]: import pandas as pd import numpy as np # For .read_csv, always use header=0 when you know row 0 is the header row train_df = pd.read_csv('csv/train.csv', header=0) In [3]: train_df.head(3)
The output is as follows:
Thus, we can see the various features: PassengerId, PClass, Name, Sex, Age, Sibsp, Parch, Ticket, Fare, Cabin, and Embarked. One question that springs to mind immediately is this: which of the features are likely to influence whether a passenger survived or not?
It should seem obvious that PassengerID, Ticket Code, and Name should not be influencers on survivability since they're identifier variables. We will skip these in our analysis.
One issue that we have to deal with in datasets for machine learning is how to handle missing values in the training set.
Let's visually identify where we have missing values in our feature set.
For that, we can make use of an equivalent of the missmap
function in R, written by Tom Augspurger. The next graphic shows how much data is missing for the various features in an intuitively pleasing manner:
For more information and the code used to generate this data, see the following: http://bit.ly/1C0a24U.
We can also calculate how much data is missing for each of the features:
In [83]: missing_perc=train_df.apply(lambda x: 100*(1-x.count().sum()/(1.0*len(x)))) In [85]: sorted_missing_perc=missing_perc.order(ascending=False) sorted_missing_perc Out[85]: Cabin 77.104377 Age 19.865320 Embarked 0.224467 Fare 0.000000 Ticket 0.000000 Parch 0.000000 SibSp 0.000000 Sex 0.000000 Name 0.000000 Pclass 0.000000 Survived 0.000000 PassengerId 0.000000 dtype: float64
Thus, we can see that most of the Cabin
data is missing (77%), while around 20% of the Age
data is missing. We then decide to drop the Cabin
data from our learning feature set as the data is too sparse to be of much use.
Let us do a further breakdown of the various features that we would like to examine. In the case of categorical/discrete features, we use bar plots; for continuous valued features, we use histograms:
In [137]: import random bar_width=0.1 categories_map={'Pclass':{'First':1,'Second':2, 'Third':3}, 'Sex':{'Female':'female','Male':'male'}, 'Survived':{'Perished':0,'Survived':1}, 'Embarked':{'Cherbourg':'C','Queenstown':'Q','Southampton':'S'}, 'SibSp': { str(x):x for x in [0,1,2,3,4,5,8]}, 'Parch': {str(x):x for x in range(7)} } colors=['red','green','blue','yellow','magenta','orange'] subplots=[111,211,311,411,511,611,711,811] cIdx=0 fig,ax=plt.subplots(len(subplots),figsize=(10,12)) keyorder = ['Survived','Sex','Pclass','Embarked','SibSp','Parch'] for category_key,category_items in sorted(categories_map.iteritems(), key=lambda i:keyorder.index(i[0])): num_bars=len(category_items) index=np.arange(num_bars) idx=0 for cat_name,cat_val in sorted(category_items.iteritems()): ax[cIdx].bar(idx,len(train_df[train_df[category_key]==cat_val]), label=cat_name, color=np.random.rand(3,1)) idx+=1 ax[cIdx].set_title('%s Breakdown' % category_key) xlabels=sorted(category_items.keys()) ax[cIdx].set_xticks(index+bar_width) ax[cIdx].set_xticklabels(xlabels) ax[cIdx].set_ylabel('Count') cIdx +=1 fig.subplots_adjust(hspace=0.8) for hcat in ['Age','Fare']: ax[cIdx].hist(train_df[hcat].dropna(),color=np.random.rand(3,1)) ax[cIdx].set_title('%s Breakdown' % hcat) #ax[cIdx].set_xlabel(hcat) ax[cIdx].set_ylabel('Frequency') cIdx +=1 fig.subplots_adjust(hspace=0.8) plt.show()
From the data and illustration in the preceding figure, we can observe the following:
These observations might lead us to dig deeper and investigate whether there is some correlation between chances of survival and gender and also fare class, particularly if we take into account the fact that the Titanic had a women-and-children-first policy (http://en.wikipedia.org/wiki/Women_and_children_first) and the fact that the Titanic was carrying fewer lifeboats (20) than it was designed to (32).
In light of this, let us further examine the relationships between survival and some of these features. We start with gender:
In [85]: from collections import OrderedDict num_passengers=len(train_df) num_men=len(train_df[train_df['Sex']=='male']) men_survived=train_df[(train_df['Survived']==1 ) & (train_df['Sex']=='male')] num_men_survived=len(men_survived) num_men_perished=num_men-num_men_survived num_women=num_passengers-num_men women_survived=train_df[(train_df['Survived']==1) & (train_df['Sex']=='female')] num_women_survived=len(women_survived) num_women_perished=num_women-num_women_survived gender_survival_dict=OrderedDict() gender_survival_dict['Survived']={'Men':num_men_survived,'Women':num_women_survived} gender_survival_dict['Perished']={'Men':num_men_perished,'Women':num_women_perished} gender_survival_dict['Survival Rate']= {'Men' : round(100.0*num_men_survived/num_men,2),'Women':round(100.0*num_women_survived/num_women,2)} pd.DataFrame(gender_survival_dict) Out[85]:
Gender |
Survived |
Perished |
Survival Rate |
---|---|---|---|
Men |
109 |
468 |
18.89 |
Women |
233 |
81 |
74.2 |
We now illustrate this data in a bar chart using the following command:
In [76]: #code to display survival by gender fig = plt.figure() ax = fig.add_subplot(111) perished_data=[num_men_perished, num_women_perished] survived_data=[num_men_survived, num_women_survived] N=2 ind = np.arange(N) # the x locations for the groups width = 0.35 survived_rects = ax.barh(ind, survived_data, width,color='green') perished_rects = ax.barh(ind+width, perished_data, width,color='red') ax.set_xlabel('Count') ax.set_title('Count of Survival by Gender') yTickMarks = ['Men','Women'] ax.set_yticks(ind+width) ytickNames = ax.set_yticklabels(yTickMarks) plt.setp(ytickNames, rotation=45, fontsize=10) ## add a legend ax.legend((survived_rects[0], perished_rects[0]), ('Survived', 'Perished') ) plt.show()
The preceding code produces the following bar graph:
From the preceding plot, we can see that a majority of the women survived (74%), while most of the men perished (only 19% survived).
This leads us to the conclusion that the gender of the passenger may be a contributing factor to whether a passenger survived or not.
Next, let us look at passenger class. First, we generate the survived and perished data for each of the three passenger classes, as well as survival rates and show them in a table:
In [86]: from collections import OrderedDict num_passengers=len(train_df) num_class1=len(train_df[train_df['Pclass']==1]) class1_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==1)] num_class1_survived=len(class1_survived) num_class1_perished=num_class1-num_class1_survived num_class2=len(train_df[train_df['Pclass']==2]) class2_survived=train_df[(train_df['Survived']==1) & (train_df['Pclass']==2)] num_class2_survived=len(class2_survived) num_class2_perished=num_class2-num_class2_survived num_class3=num_passengers-num_class1-num_class2 class3_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==3)] num_class3_survived=len(class3_survived) num_class3_perished=num_class3-num_class3_survived pclass_survival_dict=OrderedDict() pclass_survival_dict['Survived']={'1st Class':num_class1_survived, '2nd Class':num_class2_survived, '3rd Class':num_class3_survived} pclass_survival_dict['Perished']={'1st Class':num_class1_perished, '2nd Class':num_class2_perished, '3rd Class':num_class3_perished} pclass_survival_dict['Survival Rate']= {'1st Class' : round(100.0*num_class1_survived/num_class1,2), '2nd Class':round(100.0*num_class2_survived/num_class2,2), '3rd Class':round(100.0*num_class3_survived/num_class3,2),} pd.DataFrame(pclass_survival_dict) Out[86]:
Passenger Class |
Survived |
Perished |
Survival Rate |
---|---|---|---|
First Class |
136 |
80 |
62.96 |
Second Class |
87 |
97 |
47.28 |
Third Class |
119 |
372 |
24.24 |
We can then plot the data by using matplotlib
in a similar manner to that for the survivor count by gender as described earlier:
In [186]: fig = plt.figure() ax = fig.add_subplot(111) perished_data=[num_class1_perished, num_class2_perished, num_class3_perished] survived_data=[num_class1_survived, num_class2_survived, num_class3_survived] N=3 ind = np.arange(N) # the x locations for the groups width = 0.35 survived_rects = ax.barh(ind, survived_data, width,color='blue') perished_rects = ax.barh(ind+width, perished_data, width,color='red') ax.set_xlabel('Count') ax.set_title('Survivor Count by Passenger class') yTickMarks = ['1st Class','2nd Class', '3rd Class'] ax.set_yticks(ind+width) ytickNames = ax.set_yticklabels(yTickMarks) plt.setp(ytickNames, rotation=45, fontsize=10) ## add a legend ax.legend( (survived_rects[0], perished_rects[0]), ('Survived', 'Perished'), loc=10 ) plt.show()
This produces the following bar plot:
It seems clear from the preceding data and illustration that the higher the passenger fare class is, the greater are one's chances of survival.
Given that both gender and fare class seem to influence the chances of a passenger's survival, let's see what happens when we combine these two features and plot a combination of both. For this, we shall use the crosstab
function in pandas.
In [173]: survival_counts=pd.crosstab([train_df.Pclass,train_df.Sex],train_df.Survived.astype(bool)) survival_counts Out[173]: Survived False True Pclass Sex 1 female 3 91 male 77 45 2 female 6 70 male 91 17 3 female 72 72 male 300 47
Let us now display this data using matplotlib
. First, let's do some re-labeling for display purposes:
In [183]: survival_counts.index=survival_counts.index.set_levels([['1st', '2nd', '3rd'], ['Women', 'Men']]) In [184]: survival_counts.columns=['Perished','Survived']
Now, we plot the data by using the plot
function of a pandas DataFrame:
In [185]: fig = plt.figure() ax = fig.add_subplot(111) ax.set_xlabel('Count') ax.set_title('Survivor Count by Passenger class, Gender') survival_counts.plot(kind='barh',ax=ax,width=0.75, color=['red','black'], xlim=(0,400)) Out[185]: <matplotlib.axes._subplots.AxesSubplot at 0x7f714b187e90>