From Disaster to Decision – Titanic Example Revisited

In Lesson 1, From Data to Decisions – Getting Started with TensorFlow, we have seen a minimal data analysis of the Titanic dataset. Now it's our turn to do some analytics on top of the data. Let's look at what kinds of people survived the disaster.

Since we have enough data, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, being a woman, being in first class, and being a child were all factors that could boost a passengers chances of survival during this disaster.

Using the brute-force approach such as if-else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, writing such a program in Python does not make much sense. Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine-tuning for each variable and samples (that is, each passenger):

From Disaster to Decision – Titanic Example Revisited

Figure 10: A regression algorithm is meant to produce continuous output

At this point, you might have confusion in your mind about what the basic difference between a classification and a regression problem is. Well, a regression algorithm is meant to produce continuous output. The input is allowed to be either discrete or continuous. In contrast, a classification algorithm is meant to produce discrete output from an input from a set of discrete or continuous values. This distinction is important to know because discrete-valued outputs are handled better by classification, which will be discussed in upcoming sections:

From Disaster to Decision – Titanic Example Revisited

Figure 11: A classification algorithm is meant to produce discrete output

In this section, we will see how we could develop several predictive models for Titanic survival prediction and do some analytics using them. In particular, we will discuss logistic regression, random forest, and linear SVM. We start with logistic regression. Then we go with SVM since the number of features is not that large. Finally, we will see how we could improve the performance using Random Forests. However, before diving in too deeply, a short exploratory analysis of the dataset is required.

An Exploratory Analysis of the Titanic Dataset

We will see how the variables contribute to survival. At first, we need to import the required packages:

import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import shutil

Now, let's load the data and check what the features available to us are:

train = pd.read_csv(os.path.join('input', 'train.csv'))
test = pd.read_csv(os.path.join('input', 'test.csv'))
print("Information about the data")
print(train.info())
>>> 
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object

So, the training dataset has 12 columns and 891 rows altogether. Also, the Age, Cabin, and Embarked columns have null or missing values. We will take care of the null values in the feature engineering section, but for the time being, let's see how many have survived:

print("How many have survived?")
print(train.Survived.value_counts(normalize=True))
count_plot = sns.countplot(train.Survived)
count_plot.get_figure().savefig("survived_count.png")
>>>

How many have survived?

0    0.616162
1    0.383838

So, approximately 61% died and only 39% of passengers managed to survive as shown in the following figure:

An Exploratory Analysis of the Titanic Dataset

Figure 12: Survived versus dead from the Titanic training set

Now, what is the relationship between the class and the rate of survival? At first we should see the counts for each class:

train['Name_Title'] = train['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
print('Title count')
print(train['Name_Title'].value_counts())
print('Survived by title')
print(train['Survived'].groupby(train['Name_Title']).mean())
>>> 	
Title      count
Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Col.           2
Major.         2
Sir.           1
Jonkheer.      1
Lady.          1
Capt.          1
the            1
Don.           1
Ms.            1
Mme.           1

As you may remember from the movie (that is, Titanic 1997), people from higher classes had better chances of surviving. So, you may assume that the title could be an important factor in survival, too. Another funny thing is that people with longer names have a higher probability of survival. This happens due to most of the people with longer names being married ladies whose husband or family members probably helped them to survive:

train['Name_Len'] = train['Name'].apply(lambda x: len(x))
print('Survived by name length')
print(train['Survived'].groupby(pd.qcut(train['Name_Len'],5)).mean())
>>>
Survived by name length 
(11.999, 19.0]    0.220588
(19.0, 23.0]      0.301282
(23.0, 27.0]      0.319797
(27.0, 32.0]      0.442424
(32.0, 82.0]      0.674556

Women and children had a higher chance to survive, since they are the first to evacuate the shipwreck:

print('Survived by sex')
print(train['Survived'].groupby(train['Sex']).mean())
>>> 
Survived by sex
Sex
female    0.742038
male      0.188908

Cabin has the most nulls (almost 700), but we can still extract information from it, like the first letter of each cabin. Therefore, we can see that most of the cabin letters are associated with survival rate:

train['Cabin_Letter'] = train['Cabin'].apply(lambda x: str(x)[0])
print('Survived by Cabin_Letter')
print(train['Survived'].groupby(train['Cabin_Letter']).mean())
>>>
Survived by Cabin_Letter
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
T    0.000000
n    0.299854

Finally, it also seems that people who embarked at Cherbourg had a 20% higher survival rate than those embarked at other embarking locations. This is very likely due to the high percentage of upper-class passengers from that location:

print('Survived by Embarked')
print(train['Survived'].groupby(train['Embarked']).mean())
count_plot = sns.countplot(train['Embarked'], hue=train['Pclass'])
count_plot.get_figure().savefig("survived_count_by_embarked.png")

>>> 
Survived by Embarked
C    0.553571
Q    0.389610
S    0.336957

Graphically, the preceding result can be seen as follows:

An Exploratory Analysis of the Titanic Dataset

Figure 13: Survived by embarked

Thus, there were several important factors to people's survival. This means we need to consider these facts while developing our predictive models.

We will train several binary classifiers since this is a binary classification problem having two predictors, that is, 0 and 1 using the training set and will use the test set for making survival predictions.

But, before we even do that, let's do some feature engineering since you have seen that there are some missing or null values. We will either impute them or drop the entry from the training and test set. Moreover, we cannot use our datasets directly, but need to prepare them such that they could feed our machine learning models.

Feature Engineering

Since we are considering the length of the passenger's name as an important feature, it would be better to remove the name itself and compute the corresponding length and also we extract only the title:

def create_name_feat(train, test):

    for i in [train, test]:
        i['Name_Len'] = i['Name'].apply(lambda x: len(x))
        i['Name_Title'] = i['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
        del i['Name']
    return train, test

As there are 177 null values for Age, and those ones have a 10% lower survival rate than the non-nulls. Therefore, before imputing values for the nulls, we are including an Age_null flag, just to make sure we can account for this characteristic of the data:

def age_impute(train, test):
    for i in [train, test]:
        i['Age_Null_Flag'] = i['Age'].apply(lambda x: 1 if pd.isnull(x) else 0)
        data = train.groupby(['Name_Title', 'Pclass'])['Age']
        i['Age'] = data.transform(lambda x: x.fillna(x.mean()))
    return train, test

We are imputing the null age values with the mean of that column. This will add some extra bias in the dataset. But, for the betterment of our predictive model, we will have to sacrifice something.

Then we combine the SibSp and Parch columns to create get family size and break it into three levels:

def fam_size(train, test):
    for i in [train, test]:
        i['Fam_Size'] = np.where((i['SibSp']+i['Parch']) == 0, 'One',
                                 np.where((i['SibSp']+i['Parch']) <= 3, 'Small', 'Big'))
        del i['SibSp']
        del i['Parch']
    return train, test
We are using the Ticket column to create Ticket_Letr, which indicates the first letter of each ticket and Ticket_Len, which indicates the length of the Ticket field:
def ticket_grouped(train, test):
    for i in [train, test]:
        i['Ticket_Letr'] = i['Ticket'].apply(lambda x: str(x)[0])
        i['Ticket_Letr'] = i['Ticket_Letr'].apply(lambda x: str(x))
        i['Ticket_Letr'] = np.where((i['Ticket_Letr']).isin(['1', '2', '3', 'S', 'P', 'C', 'A']),
                                    i['Ticket_Letr'],
                                    np.where((i['Ticket_Letr']).isin(['W', '4', '7', '6', 'L', '5', '8']),'Low_ticket', 'Other_ticket'))
        i['Ticket_Len'] = i['Ticket'].apply(lambda x: len(x))
        del i['Ticket']
    return train, test

We also need to extract the first letter of the Cabin column:

def cabin(train, test):
    for i in [train, test]:
        i['Cabin_Letter'] = i['Cabin'].apply(lambda x: str(x)[0])
        del i['Cabin']
    return train, test

Fill the null values in the Embarked column with the most commonly occurring value, which is 'S':

def embarked_impute(train, test):
    for i in [train, test]:
        i['Embarked'] = i['Embarked'].fillna('S')
    return train, test

We now need to convert our categorical columns. So far, we have considered it important for the predictive models that we will be creating to have numerical values for string variables. The dummies() function below does a one-hot encoding to the string variables:

def dummies(train, test,
            columns = ['Pclass', 'Sex', 'Embarked', 'Ticket_Letr', 'Cabin_Letter', 'Name_Title', 'Fam_Size']):
    for column in columns:
        train[column] = train[column].apply(lambda x: str(x))
        test[column] = test[column].apply(lambda x: str(x))
        good_cols = [column+'_'+i for i in train[column].unique() if i in test[column].unique()]
        train = pd.concat((train, pd.get_dummies(train[column], prefix=column)[good_cols]), axis=1)
        test = pd.concat((test, pd.get_dummies(test[column], prefix=column)[good_cols]), axis=1)
        del train[column]
        del test[column]
    return train, test

We have the numerical features, finally, we need to create a separate column for the predicted values or targets:

def PrepareTarget(data):
    return np.array(data.Survived, dtype='int8').reshape(-1, 1)

We have seen the data and its characteristics and done some feature engineering to construct the best features for the linear models. The next task is to build the predictive models and make a prediction on the test set. Let's start with the logistic regression.

Logistic Regression for Survival Prediction

Logistic regression is one of the most widely used classifiers to predict a binary response. It is a linear machine learning method The loss function in the formulation given by the logistic loss:

Logistic Regression for Survival Prediction

For the logistic regression model, the loss function is the logistic loss. For a binary classification problem, the algorithm outputs a binary logistic regression model such that, for a given new data point, denoted by x, the model makes predictions by applying the logistic function:

Logistic Regression for Survival Prediction

In the preceding equation, Logistic Regression for Survival Prediction and if Logistic Regression for Survival Prediction, the outcome is positive; otherwise, it is negative. Note that the raw output of the logistic regression model, f (z), has a probabilistic interpretation.

Well, if you now compare logistic regression with its predecessor linear regression, the former provides you with a higher accuracy of the classification result. Moreover, it is a flexible way to regularize a model for custom adjustment and overall the model responses are measures of probability. And, most importantly, whereas linear regression can predict only continuous values, logistic regression can be generalized enough to make it predict discrete values. From now on, we will often be using the TensorFlow contrib API. So let's have a quick look at it.

Using TensorFlow Contrib

The contrib is a high level API for learning with TensorFlow. It supports the following Estimators:

  • tf.contrib.learn.BaseEstimator
  • tf.contrib.learn.Estimator
  • tf.contrib.learn.Trainable
  • tf.contrib.learn.Evaluable
  • tf.contrib.learn.KMeansClustering
  • tf.contrib.learn.ModeKeys
  • tf.contrib.learn.ModelFnOps
  • tf.contrib.learn.MetricSpec
  • tf.contrib.learn.PredictionKey
  • tf.contrib.learn.DNNClassifier
  • tf.contrib.learn.DNNRegressor
  • tf.contrib.learn.DNNLinearCombinedRegressor
  • tf.contrib.learn.DNNLinearCombinedClassifier
  • tf.contrib.learn.LinearClassifier
  • tf.contrib.learn.LinearRegressor
  • tf.contrib.learn.LogisticRegressor

Thus, without developing the logistic regression, from scratch, we will use the estimator from the TensorFlow contrib package. When we are creating our own estimator from scratch, the constructor still accepts two high-level parameters for model configuration, model_fn and params:

nn = tf.contrib.learn.Estimator(model_fn=model_fn, params=model_params)

To instantiate an Estimator we need to provide two parameters such as model_fn and the model_params as follows:

nn = tf.contrib.learn.Estimator(model_fn=model_fn, params=model_params)

It is to be noted that the model_fn() function contains all the above mentioned TensorFlow logic to support the training, evaluation, and prediction. Thus, you only need to implement the functionality that could use it efficiently.

Now, upon invoking the main() method, model_params containing the learning rate, instantiates the Estimator. You can define the model_params as follows:

model_params = {"learning_rate": LEARNING_RATE}

Note

For more information on the TensorFlow contrib, interested readers can refer to this URL at https://www.tensorflow.org/extend/estimators

Well, so far we have acquired enough background knowledge to create an LR model with TensorFlow with our dataset. It's time to implement it:

  1. Import required packages and modules:
    import os
    import shutil
    import random
    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from feature import *
    import tensorflow as tf
    from tensorflow.contrib.learn.python.learn.estimators import estimator
    from tensorflow.contrib import learn
  2. Loading and preparing the dataset.

    At first, we load both the datasets:

    random.seed(12345) # For the reproducibility 
    train = pd.read_csv(os.path.join('input', 'train.csv'))
    test = pd.read_csv(os.path.join('input', 'test.csv'))

    Let's do some feature engineering. We will invoke the function we defined in the feature engineering section, but will be provided as separate Python script with name feature.py:

    train, test = create_name_feat(train, test)
    train, test = age_impute(train, test)
    train, test = cabin(train, test)
    train, test = embarked_impute(train, test)
    train, test = fam_size(train, test)
    test['Fare'].fillna(train['Fare'].mean(), inplace=True)
    train, test = ticket_grouped(train, test)

    It is to be noted that the sequence of the above invocation is important to make the training and test set consistent. Now, we also need to create numerical values for categorical variables using the dummies() function from sklearn:

    train, test = dummies(train, test, columns=['Pclass', 'Sex', 'Embarked', 'Ticket_Letr', 'Cabin_Letter', 'Name_Title', 'Fam_Size'])

    We need to prepare the training and test set:

    TEST = True
    if TEST:
        train, test = train_test_split(train, test_size=0.25, random_state=10)
        train = train.sort_values('PassengerId')
        test = test.sort_values('PassengerId')
    
    X_train = train.iloc[:, 1:]
    x_test = test.iloc[:, 1:]

    We then convert the training and test set into a NumPy array since so far we have kept them in Pandas DataFrame format:

    x_train = np.array(x_train.iloc[:, 1:], dtype='float32')
    if TEST:
        x_test = np.array(x_test.iloc[:, 1:], dtype='float32')
    else:
        x_test = np.array(x_test, dtype='float32')

    Let's prepare the target column for prediction:

    y_train = PrepareTarget(train)

    We also need to know the feature count to build the LR estimator:

    feature_count = x_train.shape[1]
    
  3. Preparing the LR estimator.

    We build the LR estimator. We will utilize the LinearClassfier estimator for it. Since this is a binary classification problem, we provide two classes:

    def build_lr_estimator(model_dir, feature_count):
        return estimator.SKCompat(learn.LinearClassifier(
            feature_columns=[tf.contrib.layers.real_valued_column("", dimension=feature_count)],
            n_classes=2, model_dir=model_dir))
  4. Training the model.

    Here, we train the above LR estimator for 10,000 iterations. The fit() method does the trick and the predict() method computes the prediction on the training set containing the feature, that is, X_train and the label, that is, y_train:

    print("Training...")
    try:
        shutil.rmtree('lr/')
    except OSError:
        pass
    lr = build_lr_estimator('lr/', feature_count)
    lr.fit(x_train, y_train, steps=1000)
    lr_pred = lr.predict(x_test)
    lr_pred = lr_pred['classes']
  5. Model evaluation.

    We will evaluate the model seeing several classification performance metrics such as precision, recall, f1 score, and confusion matrix:

    if TEST:
        target_names = ['Not Survived', 'Survived']
        print("Logistic Regression Report")
        print(classification_report(test['Survived'], lr_pred, target_names=target_names))
        print("Logistic Regression Confusion Matrix")
    
    >>>
    Logistic Regression Report
                      precision    recall  f1-score   support
    Not Survived       0.90         0.88      0.89       147
    Survived           0.78         0.80      0.79        76---------------------------------------------------------
     avg / total       0.86         0.86       0.86       223

    Since we trained the LR model with NumPy data, we now need to convert it back to a Panda DataFrame for confusion matrix creation:

    cm = confusion_matrix(test['Survived'], lr_pred)
        df_cm = pd.DataFrame(cm, index=[i for i in ['Not Survived', 'Survived']],
                             columns=[i for i in ['Not Survived', 'Survived']])
        print(df_cm)
    
    >>> 
    Logistic Regression Confusion Matrix
                  Not Survived  Survived
    Not Survived           130        17
    Survived               15         61

    Now, let's see the count:

    print("Predicted Counts")
    print(sol.Survived.value_counts())
    
    >>> 
    Predicted Counts
    0    145
    1     78

    Since seeing the count graphically is awesome, let's draw it:

    sol = pd.DataFrame()
    sol['PassengerId'] = test['PassengerId']
    sol['Survived'] = pd.Series(lr_pred.reshape(-1)).map({True:1, False:0}).values
    sns.plt.suptitle("Predicted Survived LR")
    count_plot = sns.countplot(sol.Survived)
    count_plot.get_figure().savefig("survived_count_lr_prd.png")
    
    >>>

    The output is as follows:

    Using TensorFlow Contrib

    Figure 14: Survival prediction using logistic regression with TensorFlow

So, the accuracy we achieved with the LR model is 86% which is not that bad at all. But it can still be improved with better predictive models. In the next section, we will try to do that using linear SVM for survival prediction.

Linear SVM for Survival Prediction

The linear SVM is one of the most widely used and standard methods for large-scale classification tasks. Both the multiclass and binary classification problem can be solved using SVM with the loss function in the formulation given by the hinge loss:

Linear SVM for Survival Prediction

Usually, linear SVMs are trained with L2 regularization. Eventually, the linear SVM algorithm outputs an SVM model that can be used to predict the label of unknown data.

Suppose you have an unknown data point, x, the SVM model makes predictions based on the value of Linear SVM for Survival Prediction. The outcome can be either positive or negative. More specifically, if Linear SVM for Survival Prediction, then the predicted value is positive; otherwise, it is negative.

The current version of the TensorFlow contrib package supports only the linear SVM. TensorFlow uses SDCAOptimizer for the underlying optimization. Now, the thing is that if you want to build an SVM model of your own, you need to consider the performance and convergence tuning issues. Fortunately, you can pass the num_loss_partitions parameter to the SDCAOptimizer function. But you need to set the X such that it converges to the concurrent train ops per worker.

If you set the num_loss_partitions larger than or equal to this value, convergence is guaranteed, but this makes the overall training slower with the increase of num_loss_partitions. On the other hand, if you set its value to a smaller one, the optimizer is more aggressive in reducing the global loss, but convergence is not guaranteed.

Note

For more on the implemented contrib packages, interested readers should refer to this URL at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/learn/python/learn/estimators.

Well, so far we have acquired enough background knowledge for creating an SVM model, now it's time to implement it:

  1. Import the required packages and modules:
    import os
    import shutil
    import random
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from feature import *
    import tensorflow as tf
    from tensorflow.contrib.learn.python.learn.estimators import svm
  2. Dataset preparation for building SVM model:

    Now, the data preparation for building an SVM model is more or less the same as an LR model, except that we need to convert the PassengerId to string which is required for the SVM:

    train['PassengerId'] = train['PassengerId'].astype(str)
    test['PassengerId'] = test['PassengerId'].astype(str)
  3. Creating a dictionary for SVM for continuous feature column.

    Note

    To feed the data to the SVM model, we further need to create a dictionary mapping from each continuous feature column name (k) to the values of that column stored in a constant Tensor. For more information on this issue, refer to this issue on TensorFlow GitHub repository at https://github.com/tensorflow/tensorflow/issues/9505.

    I have written two functions for both the feature and labels. Let's see what the first one looks like:

    def train_input_fn():
        continuous_cols = {k: tf.expand_dims(tf.constant(train[k].values), 1)
                           for k in list(train) if k not in ['Survived', 'PassengerId']}
        id_col = {'PassengerId' : tf.constant(train['PassengerId'].values)}
        feature_cols = continuous_cols.copy()
        feature_cols.update(id_col)
        label = tf.constant(train["Survived"].values)
        return feature_cols, label

    The preceding function creates a dictionary mapping from each continuous feature column and then another for the passengerId column. Then I merged them into one. Since we want to target the 'Survived' column as the labels, I converted the label column into constant tensor. Finally, through this function, I returned both the feature column and the label.

    Now, the second method does almost the same trick except that it returns only the feature columns as follows:

    def predict_input_fn():
        continuous_cols = {k: tf.expand_dims(tf.constant(test[k].values), 1)
                           for k in list(test) if k not in ['Survived', 'PassengerId']}
        id_col = {'PassengerId' : tf.constant(test['PassengerId'].values)}
        feature_cols = continuous_cols.copy()
        feature_cols.update(id_col)
        return feature_cols
  4. Training the SVM model.

    Now we will iterate the training 10,000 times over the real valued column only. Finally, it creates a prediction list containing all the prediction values:

    svm_model = svm.SVM(example_id_column="PassengerId",
                        feature_columns=[tf.contrib.layers.real_valued_column(k) for k in list(train)
                                         if k not in ['Survived', 'PassengerId']], 
                        model_dir="svm/")
    svm_model.fit(input_fn=train_input_fn, steps=10000)
    svm_pred = list(svm_model.predict_classes(input_fn=predict_input_fn))
  5. Evaluation of the model:
    target_names = ['Not Survived', 'Survived']
    print("SVM Report")
    print(classification_report(test['Survived'], svm_pred, target_names=target_names))
    >>>
    SVM Report
                           precision    recall  f1-score   support
    Not Survived       0.94        0.72      0.82       117
    Survived           0.63        0.92      0.75        62--------------------------------------------------------
     avg / total       0.84         0.79      0.79       179

    Thus using SVM, the accuracy is only 79%, which is lower than that of an LR model. Well, similar to an LR model, draw and observe the confusion matrix:

    print("SVM Confusion Matrix")
    cm = confusion_matrix(test['Survived'], svm_pred)
    df_cm = pd.DataFrame(cm, index=[i for i in ['Not Survived', 'Survived']],
                            columns=[i for i in ['Not Survived', 'Survived']])
    print(df_cm)
    >>> 
    SVM Confusion Matrix
                  Not Survived  Survived
    Not Survived            84        33
    Survived                    5         57

    Then, let's draw the count plot to see the ratio visually:

    sol = pd.DataFrame()
    sol['PassengerId'] = test['PassengerId']
    sol['Survived'] = pd.Series(svm_pred).values
    sns.plt.suptitle("Titanic Survival prediction using SVM with TensorFlow")
    count_plot = sns.countplot(sol.Survived)

    The output is as follows:

    Linear SVM for Survival Prediction

    Figure 15: Survival prediction using linear SVM with TensorFlow

    Now, the count:

    print("Predicted Counts")
    print(sol.Survived.value_counts())
    
    >>> 
    Predicted Counts
    1    90
    0    89

Ensemble Method for Survival Prediction – Random Forest

One of the most widely used machine learning techniques is using the ensemble methods, which are learning algorithms that construct a set of classifiers. It can then be used to classify new data points by taking a weighted vote of their predictions. In this section, we will mainly focus on the random forest that can be built by combining 100s of decision trees.

Decision trees (DTs) is a technique which is used in supervised learning for solving classification and regression tasks. Where a DT model learns simple decision rules that are inferred from the data features by utilizing a tree-like graph to demonstrate the course of actions. Each branch of a decision tree represents a possible decision, occurrence or reaction in terms of statistical probability:

Ensemble Method for Survival Prediction – Random Forest

Figure 16: A sample decision tree on the admission test dataset using the rattle package of R

Compared to LR or SVM, the DTs are far more robust classification algorithms. The tree infers predicted labels or classes after splitting available features to the training data based to produce a good generalization. Most interestingly, the algorithm can handle both the binary as well as multiclass classification problems.

For instance, the decision trees in figure 16 learn from the admission data to approximate a sine curve with a set of if...else decision rules. The dataset contains the record of each student who applied for admission, say to an American university. Each record contains the graduate record exam score, CGPA score and the rank of the column. Now we will have to predict who is competent based on these three features (variables).

DTs can be utilized to solve this kind of problem after training the DT model and pruning the unwanted branch of the tree. In general, a deeper tree signifies more complex decision rules and a better-fitted model. Therefore, the deeper the tree, the more complex the decision rules, and the more fitted the model.

Note

If you would like to draw the above figure, just use my R script and execute on RStudio and feed the admission dataset. The script and the dataset can be found in my GitHub repository at https://github.com/rezacsedu/AdmissionUsingDecisionTree.

Well, so far we have acquired enough background knowledge for creating a Random Forest (RF) model, now it's time to implement it.

  1. Import the required packages and modules:
    import os
    import shutil
    import random
    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from feature import *
    import tensorflow as tf
    from tensorflow.contrib.learn.python.learn.estimators import estimator
    from tensorflow.contrib.tensor_forest.client import random_forest
    from tensorflow.contrib.tensor_forest.python import tensor_forest
  2. Dataset preparation for building an RF model.

    Now, the data preparation for building an RF model is more or less the same as an LR model. So please refer to the logistic regression section.

  3. Building a random forest estimator.

    The following function builds a random forest estimator. It creates 1,000 trees with maximum 1,000 nodes and 10-fold cross-validation. Since it's a binary classification problem, I put number of classes as 2:

    def build_rf_estimator(model_dir, feature_count):
        params = tensor_forest.ForestHParams(
            num_classes=2,
            num_features=feature_count,
            num_trees=1000,
            max_nodes=1000,
            min_split_samples=10)
        graph_builder_class = tensor_forest.RandomForestGraphs
        return estimator.SKCompat(random_forest.TensorForestEstimator(
            params, graph_builder_class=graph_builder_class,
            model_dir=model_dir))
  4. Training the RF model.

    Here, we train the above RF estimator. Once the fit() method does the trick and the predict() method computes the prediction on the training set containing the feature, that is, x_train and the label, that is, y_train:

    rf = build_rf_estimator('rf/', feature_count)
    rf.fit(x_train, y_train, batch_size=100)
    rf_pred = rf.predict(x_test)
    rf_pred = rf_pred['classes']
  5. Evaluating the model.

    Now let's evaluate the performance of the RF model:

        target_names = ['Not Survived', 'Survived']
        print("RandomForest Report")
        print(classification_report(test['Survived'], rf_pred, target_names=target_names))
    
    >>>
    RandomForest Report
                             precision    recall  f1-score   support
    ------------------------------------------------------
    Not Survived       0.92         0.85       0.88            117
    Survived           0.76         0.85       0.80            62
    ------------------------------------------------------
    avg / total        0.86         0.85       0.86            179

    Thus, using RF, the accuracy is 87% which is higher than that of the LR and SVM models. Well, similar to the LR and SVM model, we'll draw and observe the confusion matrix:

        print("Random Forest Confusion Matrix")
        cm = confusion_matrix(test['Survived'], rf_pred)
        df_cm = pd.DataFrame(cm, index=[i for i in ['Not Survived', 'Survived']],
                             columns=[i for i in ['Not Survived', 'Survived']])
        print(df_cm)
    >>> 
    Random Forest Confusion Matrix
                           Not Survived  Survived
    -----------------------------------------------------
    Not Survived            100             17
    Survived                 9              53

    Then, let's draw the count plot to see the ratio visually:

    sol = pd.DataFrame()
    sol['PassengerId'] = test['PassengerId']
    sol['Survived'] = pd.Series(svm_pred).values
    sns.plt.suptitle("Titanic Survival prediction using RF with TensorFlow")
    count_plot = sns.countplot(sol.Survived)

    The output is as follows:

    Ensemble Method for Survival Prediction – Random Forest

    Figure 17: Titanic survival prediction using random forest with TensorFlow

    Now, the count for each one:

    print("Predicted Counts")
    print(sol.Survived.value_counts())
    >>>  Predicted Counts
    -------------------------
    0   109
    1    70

A Comparative Analysis

From the classification reports, we can see that random forest has the best overall performance. The reason for this may be that it works better with categorical features than the other two methods. Also, since it uses implicit feature selection, overfitting was reduced significantly. Using logistic regression is a convenient probability score for observations. However, it doesn't perform well when feature space is too large that is, doesn't handle a large number of categorical features/variables well. It also solely relies on transformations for non-linear features.

Finally, using SVM we can handle a large feature space with non-linear feature interactions without relying on the entire dataset. However, it is not very well with a large number of observations. Nevertheless, it can be tricky to find an appropriate kernel sometimes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset