Logistic regression

As we have seen earlier, one problem with linear regression is that it tends to underfit the data. This gives us the lowest mean-squared error for unbiased estimators. With the underfit model, we will not get the best predictions. There are some ways to reduce this mean-squared error by adding some bias to our estimator.

Logistic regression is one of the ways to fit models for data that have true or false responses. Linear regression cannot predict all the probabilities directly, but logistic regression can. In addition, the predicted probabilities can be calibrated better when compared to the results from Naive Bayes.

For this discussion, by keeping our focus on the binary response, we can set the value of 1 to true and 0 to false. The logistic regression model assumes that the input variables can be scaled by the inverse log function; therefore, another way to take a look at this is that the log of the observed y value can be expressed as a linear combination of the n input variables of x, as shown in the following equation:

As the inverse of a logarithmic function is an exponential function, the expression on the right-hand side appears to be a version of a sigmoid of the linear combination of the variables of x. This means that the denominator can never be 1 (unless z is 0). The value of P(x) is therefore strictly greater than 0 and less than 1, as shown in the following code:

import matplotlib.pyplot as plt
import matplotlib
import random, math
import numpy as np
import scipy, scipy.stats
import pandas as pd

x = np.linspace(-10,10,100)
y1 = 1.0 / (1.0+np.exp(-x))
y2 = 1.0 / (1.0+np.exp(-x/2))
y3 = 1.0 / (1.0+np.exp(-x/10))

plt.title("Sigmoid Functions vs LineSpace")

The following image shows a standard sigmoid function:

Logistic regression

The following is an example showing probability of happy and sad.

Logistic regression

Kaggle hosts all the machine learning competitions. It usually provides the training and test data. A while ago, predicting the survivors of the Titanic was contested on Kaggle based on the real data. The titanic_train.csv and titanic_test.csv files are for training and testing purposes respectively. Using the linear_model package from scikit-learn, which includes logistic regression, we can see that the following code is a modified version of the author's version who won the contest:

Import numpy as np
import pandas as pd
import sklearn.linear_model as lm
import sklearn.cross_validation as cv
import matplotlib.pyplot as plt

train = pd.read_csv('/Users/myhome/titanic_train.csv')
test = pd.read_csv('/Users/myhome/titanic_test.csv')

data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()
data['Sex'] = data['Sex'] == 'female'
data = data.dropna()

data_np = data.astype(np.int32).values
X = data_np[:,:-1]
y = data_np[:,-1]

female = X[:,0] == 1
survived = y == 1

# This vector contains the age of the passengers.
age = X[:,1]
# We compute a few histograms.
bins_ = np.arange(0, 121, 5)
S = {'male': np.histogram(age[survived & ~female], 
     'female': np.histogram(age[survived & female], 
D = {'male': np.histogram(age[~survived & ~female], 
     'female': np.histogram(age[~survived & female], 
bins = bins_[:-1]
for i, sex, color in zip((0, 1),('male', 'female'), ('#3345d0', '#cc3dc0')):
    plt.subplot(121 + i)
    plt.bar(bins, S[sex], bottom=D[sex], color=color,
            width=5, label='Survived')
    plt.bar(bins, D[sex], color='#aaaaff', width=5, label='Died', alpha=0.4)
    plt.xlim(0, 80)

    plt.title(sex + " Survived")
    plt.xlabel("Age (years)")

(X_train, X_test, y_train, y_test) = cv.train_test_split(X, y, test_size=.05)
print X_train, y_train

# Logistic Regression from linear_model
logreg = lm.LogisticRegression();
logreg.fit(X_train, y_train)
y_predicted = logreg.predict(X_test)

plt.imshow(np.vstack((y_test, y_predicted)),
           interpolation='none', cmap='bone');
plt.xticks([]); plt.yticks([]);
plt.title(("Actual and predicted survival outcomes on the test set"))

The following is a linear regression plot showing male and female survivors of Titanic:

Logistic regression

We have seen that scikit-learn has a good collection of functions for machine learning. They also come with a few standard datasets, for example, the iris dataset and the digits dataset for the classification and the Boston house prices the dataset for regression. Machine learning is about learning the properties of data and applying these properties to the new dataset.

