Data handling and visualization

In this section, we are going to do some data preprocessing and analysis. Data exploration and analysis is considered one of the most important steps while applying machine learning and might also be considered as the most important one, because at this step, you get to know the friend, Data, which is going to stick with you during the training process. Also, knowing your data will enable you to narrow down the set of candidate algorithms that you might use to check which one is the best for your data.

Let's start off by importing the necessary packages for our implementation:

import matplotlib.pyplot as plt
 %matplotlib inline
 
 from statsmodels.nonparametric.kde import KDEUnivariate
 from statsmodels.nonparametric import smoothers_lowess
 from pandas import Series, DataFrame
 from patsy import dmatrices
 from sklearn import datasets, svm
 
 import numpy as np
 import pandas as pd
 import statsmodels.api as sm

from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

Let’s read the Titanic passengers and crew data using Pandas:

titanic_data = pd.read_csv("data/titanic_train.csv")

Next up, let's check the dimensions of our dataset and see how many examples we have and how many explanatory features are describing our dataset:

titanic_data.shape
 
 Output:
 (891, 12)

So, we have a total of 891 observations, data samples, or passenger/crew records, and 12 explanatory features for describing this record:

list(titanic_data)
 
 Output:
 ['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Let's see the data of some samples/observations:

titanic_data[500:510]

Output:

Figure 7: Samples from the titanic dataset

Now, we have a Pandas DataFrame that holds the information of 891 passengers that we need to analyze. The columns of the DataFrame represent the explanatory features about each passenger/crew, like name, sex, or age.

Some of these explanatory features are complete without any missing values, such as the survived feature, which has 891 entries. Other explanatory features contain missing values, such as the age feature, which has only 714 entries. Any missing value in the DataFrame is represented as NaN.

If you explore all of the dataset features, you will find that the ticket and cabin features have many missing values (NaNs), and so they won't add much value to our analysis. To handle this, we will drop them from the DataFrame.

Use the following line of code to drop the ticket and cabin features entirely from the DataFrame:

titanic_data = titanic_data.drop(['Ticket','Cabin'], axis=1)

There are a lot of reasons to have such missing values in our dataset. But in order to preserve the integrity of the dataset, we need to handle such missing values. In this specific problem, we will choose to drop them.

Use the following line of code in order to remove all NaN values from all the remaining features:

titanic_data = titanic_data.dropna()

Now, we have a sort of compete dataset that we can use to do our analysis. If you decided to just delete all the NaNs without deleting the ticket and cabin features first, you will find that most of the dataset is removed, because the .dropna() method removes an observation from the DataFrame, even if it has only one NaN in one of the features.

Let’s do some data visualization to see the distribution of some features and understand the relationship between the explanatory features:

# declaring graph parameters
fig = plt.figure(figsize=(18,6))
alpha=alpha_scatterplot = 0.3
alpha_bar_chart = 0.55
# Defining a grid of subplots to contain all the figures
ax1 = plt.subplot2grid((2,3),(0,0))
# Add the first bar plot which represents the count of people who survived vs not survived.
titanic_data.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
# Adding margins to the plot
ax1.set_xlim(-1, 2)
# Adding bar plot title
plt.title("Distribution of Survival, (1 = Survived)")
plt.subplot2grid((2,3),(0,1))
plt.scatter(titanic_data.Survived, titanic_data.Age, alpha=alpha_scatterplot)
# Setting the value of the y label (age)
plt.ylabel("Age")
# formatting the grid
plt.grid(b=True, which='major', axis='y')
plt.title("Survival by Age, (1 = Survived)")
ax3 = plt.subplot2grid((2,3),(0,2))
titanic_data.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart)
ax3.set_ylim(-1, len(titanic_data.Pclass.value_counts()))
plt.title("Class Distribution")
plt.subplot2grid((2,3),(1,0), colspan=2)
# plotting kernel density estimate of the subse of the 1st class passenger’s age
titanic_data.Age[titanic_data.Pclass == 1].plot(kind='kde')
titanic_data.Age[titanic_data.Pclass == 2].plot(kind='kde')
titanic_data.Age[titanic_data.Pclass == 3].plot(kind='kde')
# Adding x label (age) to the plot
plt.xlabel("Age")
plt.title("Age Distribution within classes")
# Add legend to the plot.
plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best')
ax5 = plt.subplot2grid((2,3),(1,2))
titanic_data.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
ax5.set_xlim(-1, len(titanic_data.Embarked.value_counts()))
plt.title("Passengers per boarding location")

Figure 8: Basic visualizations for the Titanic data samples

As we mentioned, the purpose of this analysis is to predict if a specific passenger will survive the tragedy based on the available feature, such as traveling class (called pclass in the data), Sex, Age, and Fare Price. So, let's see if we can get a better visual understanding of the passengers who survived and died.

First, let's draw a bar plot to see the number of observations in each class (survived/died):

plt.figure(figsize=(6,4))
fig, ax = plt.subplots()
titanic_data.Survived.value_counts().plot(kind='barh', color="blue", alpha=.65)
ax.set_ylim(-1, len(titanic_data.Survived.value_counts()))
plt.title("Breakdown of survivals(0 = Died, 1 = Survived)")

Figure 9: Survival breakdown

Let's get some more understanding of the data by breaking down the previous graph by gender:

fig = plt.figure(figsize=(18,6))
#Plotting gender based analysis for the survivals.
male = titanic_data.Survived[titanic_data.Sex == 'male'].value_counts().sort_index()
female = titanic_data.Survived[titanic_data.Sex == 'female'].value_counts().sort_index()
ax1 = fig.add_subplot(121)
male.plot(kind='barh',label='Male', alpha=0.55)
female.plot(kind='barh', color='#FA2379',label='Female', alpha=0.55)
plt.title("Gender analysis of survivals (raw value counts) "); plt.legend(loc='best')
ax1.set_ylim(-1, 2)
ax2 = fig.add_subplot(122)
(male/float(male.sum())).plot(kind='barh',label='Male', alpha=0.55)  
(female/float(female.sum())).plot(kind='barh', color='#FA2379',label='Female', alpha=0.55)
plt.title("Gender analysis of survivals"); plt.legend(loc='best')
ax2.set_ylim(-1, 2)

Figure 10: Further breakdown for the Titanic data by the gender feature

Now, we have more information about the two possible classes (survived and died). The exploration and visualization step is necessary because it gives you more insight into the structure of the data and helps you to choose the suitable learning algorithm for your problem. As you can see, we started with very basic plots and then increased the complexity of the plot to discover more about the data that we were working with.

Table of Contents for Data handling and visualization

Create new playlist

Sign In

Sign Up

Table of Contents for
Data handling and visualization