Data analysis – supervised machine learning

The purpose of this analysis is to predict the survivors. So, the outcome will be survived or not, which is a binary classification problem; in it, you have only two possible classes.

There are lots of learning algorithms that we can use for binary classification problems. Logistic regression is one of them. As explained by Wikipedia:

In statistics, logistic regression or logit regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable (a dependent variable that can take on a limited number of values, whose magnitudes are not meaningful but whose ordering of magnitudes may or may not be meaningful) based on one or more predictor variables. That is, it is used in estimating empirical values of the parameters in a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and subsequently in this article) "logistic regression" is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—and problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable.[1] As such it treats the same set of problems as does probit regression using similar techniques.

In order to use logistic regression, we need to create a formula that tells our model the type of features/inputs we're giving it:

# model formula
# here the ~ sign is an = sign, and the features of our dataset
# are written as a formula to predict survived. The C() lets our
# regression know that those variables are categorical.
# Ref: http://patsy.readthedocs.org/en/latest/formulas.html
formula = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp + C(Embarked)'
# create a results dictionary to hold our regression results for easy analysis later
results = {}
# create a regression friendly dataframe using patsy's dmatrices function
y,x = dmatrices(formula, data=titanic_data, return_type='dataframe')
# instantiate our model
model = sm.Logit(y,x)
# fit our model to the training data
res = model.fit()
# save the result for outputing predictions later
results['Logit'] = [res, formula]
res.summary()
Output:
Optimization terminated successfully.
 Current function value: 0.444388
 Iterations 6

Figure 11: Logistic regression results

Now, let's plot the prediction of our model versus actual ones and also the residuals, which is the difference between the actual and predicted values of the target variable:

# Plot Predictions Vs Actual
plt.figure(figsize=(18,4));
plt.subplot(121, axisbg="#DBDBDB")
# generate predictions from our fitted model
ypred = res.predict(x)
plt.plot(x.index, ypred, 'bo', x.index, y, 'mo', alpha=.25);
plt.grid(color='white', linestyle='dashed')
plt.title('Logit predictions, Blue: 
Fitted/predicted values: Red');
# Residuals
ax2 = plt.subplot(122, axisbg="#DBDBDB")
plt.plot(res.resid_dev, 'r-')
plt.grid(color='white', linestyle='dashed')
ax2.set_xlim(-1, len(res.resid_dev))
plt.title('Logit Residuals');

Figure 12: Understanding the logit regression model

Now, we have built our logistic regression model, and prior to that, we have done some analysis and exploration of the dataset. The preceding example shows you the general pipelines for building a machine learning solution.

Most of the time, practitioners fall into some technical pitfalls because they lack experience of understanding the concepts of machine learning. For example, someone might get an accuracy of 99% over the test set, and then without doing any investigation of the distribution of classes in the data (such as how many samples are negative and how many samples are positive), they deploy the model.

To highlight some of these concepts and differentiate between different kinds of errors that you need to be aware of and which ones you should really care about, we'll move on to the next section.

Table of Contents for Data analysis&#xA0;&#x2013; supervised machine learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Data analysis – supervised machine learning