As we have seen earlier, one problem with linear regression is that it tends to underfit the data. This gives us the lowest mean-squared error for unbiased estimators. With the underfit model, we will not get the best predictions. There are some ways to reduce this mean-squared error by adding some bias to our estimator.
Logistic regression is one of the ways to fit models for data that have true or false responses. Linear regression cannot predict all the probabilities directly, but logistic regression can. In addition, the predicted probabilities can be calibrated better when compared to the results from Naive Bayes.
For this discussion, by keeping our focus on the binary response, we can set the value of 1
to true
and 0
to false
. The logistic regression model assumes that the input variables can be scaled by the inverse log function; therefore, another way to take a look at this is that the log of the observed y value can be expressed as a linear combination of the n input variables of x, as shown in the following equation:
As the inverse of a logarithmic function is an exponential function, the expression on the right-hand side appears to be a version of a sigmoid of the linear combination of the variables of x. This means that the denominator can never be 1 (unless z is 0). The value of P(x) is therefore strictly greater than 0 and less than 1, as shown in the following code:
import matplotlib.pyplot as plt import matplotlib import random, math import numpy as np import scipy, scipy.stats import pandas as pd x = np.linspace(-10,10,100) y1 = 1.0 / (1.0+np.exp(-x)) y2 = 1.0 / (1.0+np.exp(-x/2)) y3 = 1.0 / (1.0+np.exp(-x/10)) plt.title("Sigmoid Functions vs LineSpace") plt.plot(x,y1,'r-',lw=2) plt.plot(x,y2,'g-',lw=2) plt.plot(x,y3,'b-',lw=2) plt.xlabel("x") plt.ylabel("y") plt.show()
The following image shows a standard sigmoid function:
The following is an example showing probability of happy and sad.
Kaggle hosts all the machine learning competitions. It usually provides the training and test data. A while ago, predicting the survivors of the Titanic was contested on Kaggle based on the real data. The titanic_train.csv
and titanic_test.csv
files are for training and testing purposes respectively. Using the linear_model
package from scikit-learn
, which includes logistic regression, we can see that the following code is a modified version of the author's version who won the contest:
Import numpy as np import pandas as pd import sklearn.linear_model as lm import sklearn.cross_validation as cv import matplotlib.pyplot as plt train = pd.read_csv('/Users/myhome/titanic_train.csv') test = pd.read_csv('/Users/myhome/titanic_test.csv') train[train.columns[[2,4,5,1]]].head() data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy() data['Sex'] = data['Sex'] == 'female' data = data.dropna() data_np = data.astype(np.int32).values X = data_np[:,:-1] y = data_np[:,-1] female = X[:,0] == 1 survived = y == 1 # This vector contains the age of the passengers. age = X[:,1] # We compute a few histograms. bins_ = np.arange(0, 121, 5) S = {'male': np.histogram(age[survived & ~female], bins=bins_)[0], 'female': np.histogram(age[survived & female], bins=bins_)[0]} D = {'male': np.histogram(age[~survived & ~female], bins=bins_)[0], 'female': np.histogram(age[~survived & female], bins=bins_)[0]} bins = bins_[:-1] plt.figure(figsize=(15,8)) for i, sex, color in zip((0, 1),('male', 'female'), ('#3345d0', '#cc3dc0')): plt.subplot(121 + i) plt.bar(bins, S[sex], bottom=D[sex], color=color, width=5, label='Survived') plt.bar(bins, D[sex], color='#aaaaff', width=5, label='Died', alpha=0.4) plt.xlim(0, 80) plt.grid(None) plt.title(sex + " Survived") plt.xlabel("Age (years)") plt.legend() (X_train, X_test, y_train, y_test) = cv.train_test_split(X, y, test_size=.05) print X_train, y_train # Logistic Regression from linear_model logreg = lm.LogisticRegression(); logreg.fit(X_train, y_train) y_predicted = logreg.predict(X_test) plt.figure(figsize=(15,8)); plt.imshow(np.vstack((y_test, y_predicted)), interpolation='none', cmap='bone'); plt.xticks([]); plt.yticks([]); plt.title(("Actual and predicted survival outcomes on the test set"))
The following is a linear regression plot showing male and female survivors of Titanic:
We have seen that scikit-learn
has a good collection of functions for machine learning. They also come with a few standard datasets, for example, the iris dataset and the digits dataset for the classification and the Boston house prices the dataset for regression. Machine learning is about learning the properties of data and applying these properties to the new dataset.