In Lesson 1, From Data to Decisions – Getting Started with TensorFlow, we have seen a minimal data analysis of the Titanic dataset. Now it's our turn to do some analytics on top of the data. Let's look at what kinds of people survived the disaster.
Since we have enough data, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, being a woman, being in first class, and being a child were all factors that could boost a passengers chances of survival during this disaster.
Using the brute-force approach such as if-else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, writing such a program in Python does not make much sense. Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine-tuning for each variable and samples (that is, each passenger):
At this point, you might have confusion in your mind about what the basic difference between a classification and a regression problem is. Well, a regression algorithm is meant to produce continuous output. The input is allowed to be either discrete or continuous. In contrast, a classification algorithm is meant to produce discrete output from an input from a set of discrete or continuous values. This distinction is important to know because discrete-valued outputs are handled better by classification, which will be discussed in upcoming sections:
In this section, we will see how we could develop several predictive models for Titanic survival prediction and do some analytics using them. In particular, we will discuss logistic regression, random forest, and linear SVM. We start with logistic regression. Then we go with SVM since the number of features is not that large. Finally, we will see how we could improve the performance using Random Forests. However, before diving in too deeply, a short exploratory analysis of the dataset is required.
We will see how the variables contribute to survival. At first, we need to import the required packages:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import shutil
Now, let's load the data and check what the features available to us are:
train = pd.read_csv(os.path.join('input', 'train.csv')) test = pd.read_csv(os.path.join('input', 'test.csv')) print("Information about the data") print(train.info()) >>> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object
So, the training dataset has 12
columns and 891
rows altogether. Also, the Age
, Cabin
, and Embarked
columns have null or missing values. We will take care of the null values in the feature engineering section, but for the time being, let's see how many have survived:
print("How many have survived?") print(train.Survived.value_counts(normalize=True)) count_plot = sns.countplot(train.Survived) count_plot.get_figure().savefig("survived_count.png") >>>
How many have survived?
0 0.616162 1 0.383838
So, approximately 61% died and only 39% of passengers managed to survive as shown in the following figure:
Now, what is the relationship between the class and the rate of survival? At first we should see the counts for each class:
train['Name_Title'] = train['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0]) print('Title count') print(train['Name_Title'].value_counts()) print('Survived by title') print(train['Survived'].groupby(train['Name_Title']).mean()) >>> Title count Mr. 517 Miss. 182 Mrs. 125 Master. 40 Dr. 7 Rev. 6 Mlle. 2 Col. 2 Major. 2 Sir. 1 Jonkheer. 1 Lady. 1 Capt. 1 the 1 Don. 1 Ms. 1 Mme. 1
As you may remember from the movie (that is, Titanic 1997), people from higher classes had better chances of surviving. So, you may assume that the title could be an important factor in survival, too. Another funny thing is that people with longer names have a higher probability of survival. This happens due to most of the people with longer names being married ladies whose husband or family members probably helped them to survive:
train['Name_Len'] = train['Name'].apply(lambda x: len(x)) print('Survived by name length') print(train['Survived'].groupby(pd.qcut(train['Name_Len'],5)).mean()) >>> Survived by name length (11.999, 19.0] 0.220588 (19.0, 23.0] 0.301282 (23.0, 27.0] 0.319797 (27.0, 32.0] 0.442424 (32.0, 82.0] 0.674556
Women and children had a higher chance to survive, since they are the first to evacuate the shipwreck:
print('Survived by sex') print(train['Survived'].groupby(train['Sex']).mean()) >>> Survived by sex Sex female 0.742038 male 0.188908
Cabin has the most nulls (almost 700), but we can still extract information from it, like the first letter of each cabin. Therefore, we can see that most of the cabin letters are associated with survival rate:
train['Cabin_Letter'] = train['Cabin'].apply(lambda x: str(x)[0]) print('Survived by Cabin_Letter') print(train['Survived'].groupby(train['Cabin_Letter']).mean()) >>> Survived by Cabin_Letter A 0.466667 B 0.744681 C 0.593220 D 0.757576 E 0.750000 F 0.615385 G 0.500000 T 0.000000 n 0.299854
Finally, it also seems that people who embarked at Cherbourg had a 20% higher survival rate than those embarked at other embarking locations. This is very likely due to the high percentage of upper-class passengers from that location:
print('Survived by Embarked') print(train['Survived'].groupby(train['Embarked']).mean()) count_plot = sns.countplot(train['Embarked'], hue=train['Pclass']) count_plot.get_figure().savefig("survived_count_by_embarked.png") >>> Survived by Embarked C 0.553571 Q 0.389610 S 0.336957
Graphically, the preceding result can be seen as follows:
Thus, there were several important factors to people's survival. This means we need to consider these facts while developing our predictive models.
We will train several binary classifiers since this is a binary classification problem having two predictors, that is, 0 and 1 using the training set and will use the test set for making survival predictions.
But, before we even do that, let's do some feature engineering since you have seen that there are some missing or null values. We will either impute them or drop the entry from the training and test set. Moreover, we cannot use our datasets directly, but need to prepare them such that they could feed our machine learning models.
Since we are considering the length of the passenger's name as an important feature, it would be better to remove the name itself and compute the corresponding length and also we extract only the title:
def create_name_feat(train, test):
for i in [train, test]: i['Name_Len'] = i['Name'].apply(lambda x: len(x)) i['Name_Title'] = i['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0]) del i['Name'] return train, test
As there are 177 null values for Age, and those ones have a 10% lower survival rate than the non-nulls. Therefore, before imputing values for the nulls, we are including an Age_null flag, just to make sure we can account for this characteristic of the data:
def age_impute(train, test): for i in [train, test]: i['Age_Null_Flag'] = i['Age'].apply(lambda x: 1 if pd.isnull(x) else 0) data = train.groupby(['Name_Title', 'Pclass'])['Age'] i['Age'] = data.transform(lambda x: x.fillna(x.mean())) return train, test
We are imputing the null age values with the mean of that column. This will add some extra bias in the dataset. But, for the betterment of our predictive model, we will have to sacrifice something.
Then we combine the SibSp
and Parch
columns to create get family size and break it into three levels:
def fam_size(train, test): for i in [train, test]: i['Fam_Size'] = np.where((i['SibSp']+i['Parch']) == 0, 'One', np.where((i['SibSp']+i['Parch']) <= 3, 'Small', 'Big')) del i['SibSp'] del i['Parch'] return train, test We are using theTicket
column to createTicket_Letr
, which indicates the first letter of each ticket andTicket_Len
, which indicates the length of the Ticket field:
def ticket_grouped(train, test): for i in [train, test]: i['Ticket_Letr'] = i['Ticket'].apply(lambda x: str(x)[0]) i['Ticket_Letr'] = i['Ticket_Letr'].apply(lambda x: str(x)) i['Ticket_Letr'] = np.where((i['Ticket_Letr']).isin(['1', '2', '3', 'S', 'P', 'C', 'A']), i['Ticket_Letr'], np.where((i['Ticket_Letr']).isin(['W', '4', '7', '6', 'L', '5', '8']),'Low_ticket', 'Other_ticket')) i['Ticket_Len'] = i['Ticket'].apply(lambda x: len(x)) del i['Ticket'] return train, test
We also need to extract the first letter of the Cabin
column:
def cabin(train, test): for i in [train, test]: i['Cabin_Letter'] = i['Cabin'].apply(lambda x: str(x)[0]) del i['Cabin'] return train, test
Fill the null values in the Embarked
column with the most commonly occurring value, which is 'S'
:
def embarked_impute(train, test): for i in [train, test]: i['Embarked'] = i['Embarked'].fillna('S') return train, test
We now need to convert our categorical columns. So far, we have considered it important for the predictive models that we will be creating to have numerical values for string variables. The dummies()
function below does a one-hot encoding to the string variables:
def dummies(train, test, columns = ['Pclass', 'Sex', 'Embarked', 'Ticket_Letr', 'Cabin_Letter', 'Name_Title', 'Fam_Size']): for column in columns: train[column] = train[column].apply(lambda x: str(x)) test[column] = test[column].apply(lambda x: str(x)) good_cols = [column+'_'+i for i in train[column].unique() if i in test[column].unique()] train = pd.concat((train, pd.get_dummies(train[column], prefix=column)[good_cols]), axis=1) test = pd.concat((test, pd.get_dummies(test[column], prefix=column)[good_cols]), axis=1) del train[column] del test[column] return train, test
We have the numerical features, finally, we need to create a separate column for the predicted values or targets:
def PrepareTarget(data): return np.array(data.Survived, dtype='int8').reshape(-1, 1)
We have seen the data and its characteristics and done some feature engineering to construct the best features for the linear models. The next task is to build the predictive models and make a prediction on the test set. Let's start with the logistic regression.
Logistic regression is one of the most widely used classifiers to predict a binary response. It is a linear machine learning method The loss
function in the formulation given by the logistic loss:
For the logistic regression model, the loss function is the logistic loss. For a binary classification problem, the algorithm outputs a binary logistic regression model such that, for a given new data point, denoted by x, the model makes predictions by applying the logistic function:
In the preceding equation, and if , the outcome is positive; otherwise, it is negative. Note that the raw output of the logistic regression model, f (z), has a probabilistic interpretation.
Well, if you now compare logistic regression with its predecessor linear regression, the former provides you with a higher accuracy of the classification result. Moreover, it is a flexible way to regularize a model for custom adjustment and overall the model responses are measures of probability. And, most importantly, whereas linear regression can predict only continuous values, logistic regression can be generalized enough to make it predict discrete values. From now on, we will often be using the TensorFlow contrib API. So let's have a quick look at it.
The contrib is a high level API for learning with TensorFlow. It supports the following Estimators:
tf.contrib.learn.BaseEstimator
tf.contrib.learn.Estimator
tf.contrib.learn.Trainable
tf.contrib.learn.Evaluable
tf.contrib.learn.KMeansClustering
tf.contrib.learn.ModeKeys
tf.contrib.learn.ModelFnOps
tf.contrib.learn.MetricSpec
tf.contrib.learn.PredictionKey
tf.contrib.learn.DNNClassifier
tf.contrib.learn.DNNRegressor
tf.contrib.learn.DNNLinearCombinedRegressor
tf.contrib.learn.DNNLinearCombinedClassifier
tf.contrib.learn.LinearClassifier
tf.contrib.learn.LinearRegressor
tf.contrib.learn.LogisticRegressor
Thus, without developing the logistic regression, from scratch, we will use the estimator from the TensorFlow contrib package. When we are creating our own estimator from scratch, the constructor still accepts two high-level parameters for model configuration, model_fn
and params
:
nn = tf.contrib.learn.Estimator(model_fn=model_fn, params=model_params)
To instantiate an Estimator we need to provide two parameters such as model_fn
and the model_params
as follows:
nn = tf.contrib.learn.Estimator(model_fn=model_fn, params=model_params)
It is to be noted that the model_fn()
function contains all the above mentioned TensorFlow logic to support the training, evaluation, and prediction. Thus, you only need to implement the functionality that could use it efficiently.
Now, upon invoking the main()
method, model_params
containing the learning rate, instantiates the Estimator. You can define the model_params
as follows:
model_params = {"learning_rate": LEARNING_RATE}
For more information on the TensorFlow contrib, interested readers can refer to this URL at https://www.tensorflow.org/extend/estimators
Well, so far we have acquired enough background knowledge to create an LR model with TensorFlow with our dataset. It's time to implement it:
import os import shutil import random import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from feature import * import tensorflow as tf from tensorflow.contrib.learn.python.learn.estimators import estimator from tensorflow.contrib import learn
At first, we load both the datasets:
random.seed(12345) # For the reproducibility train = pd.read_csv(os.path.join('input', 'train.csv')) test = pd.read_csv(os.path.join('input', 'test.csv'))
Let's do some feature engineering. We will invoke the function we defined in the feature engineering section, but will be provided as separate Python script with name feature.py
:
train, test = create_name_feat(train, test) train, test = age_impute(train, test) train, test = cabin(train, test) train, test = embarked_impute(train, test) train, test = fam_size(train, test) test['Fare'].fillna(train['Fare'].mean(), inplace=True) train, test = ticket_grouped(train, test)
It is to be noted that the sequence of the above invocation is important to make the training and test set consistent. Now, we also need to create numerical values for categorical variables using the dummies()
function from sklearn:
train, test = dummies(train, test, columns=['Pclass', 'Sex', 'Embarked', 'Ticket_Letr', 'Cabin_Letter', 'Name_Title', 'Fam_Size'])
We need to prepare the training and test set:
TEST = True if TEST: train, test = train_test_split(train, test_size=0.25, random_state=10) train = train.sort_values('PassengerId') test = test.sort_values('PassengerId') X_train = train.iloc[:, 1:] x_test = test.iloc[:, 1:]
We then convert the training and test set into a NumPy array since so far we have kept them in Pandas DataFrame format:
x_train = np.array(x_train.iloc[:, 1:], dtype='float32')
if TEST:
x_test = np.array(x_test.iloc[:, 1:], dtype='float32')
else:
x_test = np.array(x_test, dtype='float32')
Let's prepare the target column for prediction:
y_train = PrepareTarget(train)
We also need to know the feature count to build the LR estimator:
feature_count = x_train.shape[1]
We build the LR estimator. We will utilize the LinearClassfier
estimator for it. Since this is a binary classification problem, we provide two classes:
def build_lr_estimator(model_dir, feature_count): return estimator.SKCompat(learn.LinearClassifier( feature_columns=[tf.contrib.layers.real_valued_column("", dimension=feature_count)], n_classes=2, model_dir=model_dir))
Here, we train the above LR estimator for 10,000
iterations. The fit()
method does the trick and the predict()
method computes the prediction on the training set containing the feature, that is, X_train
and the label, that is, y_train
:
print("Training...") try: shutil.rmtree('lr/') except OSError: pass lr = build_lr_estimator('lr/', feature_count) lr.fit(x_train, y_train, steps=1000) lr_pred = lr.predict(x_test) lr_pred = lr_pred['classes']
We will evaluate the model seeing several classification performance metrics such as precision, recall, f1 score, and confusion matrix:
if TEST:
target_names = ['Not Survived', 'Survived']
print("Logistic Regression Report")
print(classification_report(test['Survived'], lr_pred, target_names=target_names))
print("Logistic Regression Confusion Matrix")
>>>
Logistic Regression Report
precision recall f1-score support
Not Survived 0.90 0.88 0.89 147
Survived 0.78 0.80 0.79 76---------------------------------------------------------
avg / total 0.86 0.86 0.86 223
Since we trained the LR model with NumPy data, we now need to convert it back to a Panda DataFrame for confusion matrix creation:
cm = confusion_matrix(test['Survived'], lr_pred) df_cm = pd.DataFrame(cm, index=[i for i in ['Not Survived', 'Survived']], columns=[i for i in ['Not Survived', 'Survived']]) print(df_cm) >>> Logistic Regression Confusion Matrix Not Survived Survived Not Survived 130 17 Survived 15 61
Now, let's see the count:
print("Predicted Counts") print(sol.Survived.value_counts()) >>> Predicted Counts 0 145 1 78
Since seeing the count graphically is awesome, let's draw it:
sol = pd.DataFrame() sol['PassengerId'] = test['PassengerId'] sol['Survived'] = pd.Series(lr_pred.reshape(-1)).map({True:1, False:0}).values sns.plt.suptitle("Predicted Survived LR") count_plot = sns.countplot(sol.Survived) count_plot.get_figure().savefig("survived_count_lr_prd.png") >>>
The output is as follows:
So, the accuracy we achieved with the LR model is 86% which is not that bad at all. But it can still be improved with better predictive models. In the next section, we will try to do that using linear SVM for survival prediction.
The linear SVM is one of the most widely used and standard methods for large-scale classification tasks. Both the multiclass and binary classification problem can be solved using SVM with the loss function in the formulation given by the hinge loss:
Usually, linear SVMs are trained with L2 regularization. Eventually, the linear SVM algorithm outputs an SVM model that can be used to predict the label of unknown data.
Suppose you have an unknown data point, x, the SVM model makes predictions based on the value of . The outcome can be either positive or negative. More specifically, if , then the predicted value is positive; otherwise, it is negative.
The current version of the TensorFlow contrib package supports only the linear SVM. TensorFlow uses SDCAOptimizer for the underlying optimization. Now, the thing is that if you want to build an SVM model of your own, you need to consider the performance and convergence tuning issues. Fortunately, you can pass the num_loss_partitions
parameter to the SDCAOptimizer function. But you need to set the X such that it converges to the concurrent train ops per worker.
If you set the num_loss_partitions
larger than or equal to this value, convergence is guaranteed, but this makes the overall training slower with the increase of num_loss_partitions
. On the other hand, if you set its value to a smaller one, the optimizer is more aggressive in reducing the global loss, but convergence is not guaranteed.
For more on the implemented contrib packages, interested readers should refer to this URL at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/learn/python/learn/estimators.
Well, so far we have acquired enough background knowledge for creating an SVM model, now it's time to implement it:
import os import shutil import random import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from feature import * import tensorflow as tf from tensorflow.contrib.learn.python.learn.estimators import svm
Now, the data preparation for building an SVM model is more or less the same as an LR model, except that we need to convert the PassengerId
to string which is required for the SVM:
train['PassengerId'] = train['PassengerId'].astype(str) test['PassengerId'] = test['PassengerId'].astype(str)
To feed the data to the SVM model, we further need to create a dictionary mapping from each continuous feature column name (k) to the values of that column stored in a constant Tensor. For more information on this issue, refer to this issue on TensorFlow GitHub repository at https://github.com/tensorflow/tensorflow/issues/9505.
I have written two functions for both the feature and labels. Let's see what the first one looks like:
def train_input_fn(): continuous_cols = {k: tf.expand_dims(tf.constant(train[k].values), 1) for k in list(train) if k not in ['Survived', 'PassengerId']} id_col = {'PassengerId' : tf.constant(train['PassengerId'].values)} feature_cols = continuous_cols.copy() feature_cols.update(id_col) label = tf.constant(train["Survived"].values) return feature_cols, label
The preceding function creates a dictionary mapping from each continuous feature column and then another for the passengerId
column. Then I merged them into one. Since we want to target the 'Survived' column as the labels, I converted the label column into constant tensor. Finally, through this function, I returned both the feature column and the label.
Now, the second method does almost the same trick except that it returns only the feature columns as follows:
def predict_input_fn(): continuous_cols = {k: tf.expand_dims(tf.constant(test[k].values), 1) for k in list(test) if k not in ['Survived', 'PassengerId']} id_col = {'PassengerId' : tf.constant(test['PassengerId'].values)} feature_cols = continuous_cols.copy() feature_cols.update(id_col) return feature_cols
Now we will iterate the training 10,000 times over the real valued column only. Finally, it creates a prediction list containing all the prediction values:
svm_model = svm.SVM(example_id_column="PassengerId", feature_columns=[tf.contrib.layers.real_valued_column(k) for k in list(train) if k not in ['Survived', 'PassengerId']], model_dir="svm/") svm_model.fit(input_fn=train_input_fn, steps=10000) svm_pred = list(svm_model.predict_classes(input_fn=predict_input_fn))
target_names = ['Not Survived', 'Survived'] print("SVM Report") print(classification_report(test['Survived'], svm_pred, target_names=target_names)) >>> SVM Report precision recall f1-score support Not Survived 0.94 0.72 0.82 117 Survived 0.63 0.92 0.75 62-------------------------------------------------------- avg / total 0.84 0.79 0.79 179
Thus using SVM, the accuracy is only 79%, which is lower than that of an LR model. Well, similar to an LR model, draw and observe the confusion matrix:
print("SVM Confusion Matrix") cm = confusion_matrix(test['Survived'], svm_pred) df_cm = pd.DataFrame(cm, index=[i for i in ['Not Survived', 'Survived']], columns=[i for i in ['Not Survived', 'Survived']]) print(df_cm) >>> SVM Confusion Matrix Not Survived Survived Not Survived 84 33 Survived 5 57
Then, let's draw the count plot to see the ratio visually:
sol = pd.DataFrame() sol['PassengerId'] = test['PassengerId'] sol['Survived'] = pd.Series(svm_pred).values sns.plt.suptitle("Titanic Survival prediction using SVM with TensorFlow") count_plot = sns.countplot(sol.Survived)
The output is as follows:
Now, the count:
print("Predicted Counts") print(sol.Survived.value_counts()) >>> Predicted Counts 1 90 0 89
One of the most widely used machine learning techniques is using the ensemble methods, which are learning algorithms that construct a set of classifiers. It can then be used to classify new data points by taking a weighted vote of their predictions. In this section, we will mainly focus on the random forest that can be built by combining 100s of decision trees.
Decision trees (DTs) is a technique which is used in supervised learning for solving classification and regression tasks. Where a DT model learns simple decision rules that are inferred from the data features by utilizing a tree-like graph to demonstrate the course of actions. Each branch of a decision tree represents a possible decision, occurrence or reaction in terms of statistical probability:
Compared to LR or SVM, the DTs are far more robust classification algorithms. The tree infers predicted labels or classes after splitting available features to the training data based to produce a good generalization. Most interestingly, the algorithm can handle both the binary as well as multiclass classification problems.
For instance, the decision trees in figure 16 learn from the admission data to approximate a sine curve with a set of if...else
decision rules. The dataset contains the record of each student who applied for admission, say to an American university. Each record contains the graduate record exam score, CGPA score and the rank of the column. Now we will have to predict who is competent based on these three features (variables).
DTs can be utilized to solve this kind of problem after training the DT model and pruning the unwanted branch of the tree. In general, a deeper tree signifies more complex decision rules and a better-fitted model. Therefore, the deeper the tree, the more complex the decision rules, and the more fitted the model.
If you would like to draw the above figure, just use my R script and execute on RStudio and feed the admission dataset. The script and the dataset can be found in my GitHub repository at https://github.com/rezacsedu/AdmissionUsingDecisionTree.
Well, so far we have acquired enough background knowledge for creating a Random Forest (RF) model, now it's time to implement it.
import os import shutil import random import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from feature import * import tensorflow as tf from tensorflow.contrib.learn.python.learn.estimators import estimator from tensorflow.contrib.tensor_forest.client import random_forest from tensorflow.contrib.tensor_forest.python import tensor_forest
Now, the data preparation for building an RF model is more or less the same as an LR model. So please refer to the logistic regression section.
The following function builds a random forest estimator. It creates 1,000 trees with maximum 1,000 nodes and 10-fold cross-validation. Since it's a binary classification problem, I put number of classes as 2:
def build_rf_estimator(model_dir, feature_count): params = tensor_forest.ForestHParams( num_classes=2, num_features=feature_count, num_trees=1000, max_nodes=1000, min_split_samples=10) graph_builder_class = tensor_forest.RandomForestGraphs return estimator.SKCompat(random_forest.TensorForestEstimator( params, graph_builder_class=graph_builder_class, model_dir=model_dir))
Here, we train the above RF estimator. Once the fit()
method does the trick and the predict()
method computes the prediction on the training set containing the feature, that is, x_train
and the label, that is, y_train
:
rf = build_rf_estimator('rf/', feature_count) rf.fit(x_train, y_train, batch_size=100) rf_pred = rf.predict(x_test) rf_pred = rf_pred['classes']
Now let's evaluate the performance of the RF model:
target_names = ['Not Survived', 'Survived'] print("RandomForest Report") print(classification_report(test['Survived'], rf_pred, target_names=target_names)) >>> RandomForest Report precision recall f1-score support ------------------------------------------------------ Not Survived 0.92 0.85 0.88 117 Survived 0.76 0.85 0.80 62 ------------------------------------------------------ avg / total 0.86 0.85 0.86 179
Thus, using RF, the accuracy is 87% which is higher than that of the LR and SVM models. Well, similar to the LR and SVM model, we'll draw and observe the confusion matrix:
print("Random Forest Confusion Matrix") cm = confusion_matrix(test['Survived'], rf_pred) df_cm = pd.DataFrame(cm, index=[i for i in ['Not Survived', 'Survived']], columns=[i for i in ['Not Survived', 'Survived']]) print(df_cm) >>> Random Forest Confusion Matrix Not Survived Survived ----------------------------------------------------- Not Survived 100 17 Survived 9 53
Then, let's draw the count plot to see the ratio visually:
sol = pd.DataFrame() sol['PassengerId'] = test['PassengerId'] sol['Survived'] = pd.Series(svm_pred).values sns.plt.suptitle("Titanic Survival prediction using RF with TensorFlow") count_plot = sns.countplot(sol.Survived)
The output is as follows:
Now, the count for each one:
print("Predicted Counts") print(sol.Survived.value_counts()) >>> Predicted Counts ------------------------- 0 109 1 70
From the classification reports, we can see that random forest has the best overall performance. The reason for this may be that it works better with categorical features than the other two methods. Also, since it uses implicit feature selection, overfitting was reduced significantly. Using logistic regression is a convenient probability score for observations. However, it doesn't perform well when feature space is too large that is, doesn't handle a large number of categorical features/variables well. It also solely relies on transformations for non-linear features.
Finally, using SVM we can handle a large feature space with non-linear feature interactions without relying on the entire dataset. However, it is not very well with a large number of observations. Nevertheless, it can be tricky to find an appropriate kernel sometimes.