Logistic regression is a statistical technique used to predict a binary outcome, for example, purchase/no-purchase.
For this recipe, we are going to use the Heart dataset from An Introduction to Statistical Learning with Applications in R.
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv' heart = pd.read_csv(data_file, sep=',', header=0, index_col=0, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False ) heart.head()
heart.dtypes
heart.shape
ChestPain
column to a numeric value:t2 = pd.Series({'asymptomatic' : 1, 'nonanginal' : 2, 'nontypical' : 3, 'typical': 4}) heart['ChestPain'] = heart['ChestPain'].map(t2) heart.head()
Thal
column to a numeric value:t = pd.Series({'fixed' : 1, 'normal' : 2, 'reversible' : 3}) heart['Thal'] = heart['Thal'].map(t) heart.head()
AHD
column to a numeric value:t = pd.Series({'No' : 0, 'Yes' : 1}) heart['AHD'] = heart['AHD'].map(t) heart.head()
0
:heart.fillna(0, inplace=True) heart.head()
heart.shape
heart_data = heart.iloc[:,0:13].values heart_targets = heart['AHD'].values
scikit-learn
and build the model:from sklearn import linear_model logClassifier = linear_model.LogisticRegression(C=1, random_state=111)
from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split(heart_data, heart_targets, test_size=0.20, random_state=111)
cross_val_score
:scores = cross_validation.cross_val_score(logClassifier, heart_data, heart_targets, cv=12) scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
logClassifier.fit(X_train, y_train)
predicted = logClassifier.predict(X_test)
from sklearn import metrics metrics.accuracy_score(y_test, predicted)
metrics.confusion_matrix(y_test, predicted)
The first thing that we need to do is to import all the Python libraries that we'll need. The last line of code—%matplotlib inline
—is required only if you are running the code in IPython Notebook:
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
Next, we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code.
data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'
Once you have the data file variable, use the read_csv()
function provided by Pandas to create a DataFrame from the CSV file.
heart = pd.read_csv(data_file, sep=',', header=0, index_col=0, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False )
If using IPython Notebook, use the head()
function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly:
heart.head()
Next we use the shape
function to find out the number of rows and columns in the DataFrame:
heart.shape
In order to feed the data to our model, we need to convert all text data to integers. We start by converting the ChestPain
column to a numeric value.
We first create a Pandas Series object, and assign numeric values to each unique value in the ChestPain
column:
t2 = pd.Series({'asymptomatic' : 1, 'nonanginal' : 2, 'nontypical' : 3, 'typical': 4})
Next we use the built-in Python map()
function which will go through the ChestPain
column of the DataFrame and replace the text value with the corresponding integer value from the series we created. To verify the results, we use the head()
function:
heart['ChestPain'] = heart['ChestPain'].map(t2) heart.head()
We then use the same method to convert the Thal
column to a numeric value:
t = pd.Series({'fixed' : 1, 'normal' : 2, 'reversible' : 3}) heart['Thal'] = heart['Thal'].map(t) heart.head()
After that we convert the AHD
column to a numeric value as well:
t = pd.Series({'No' : 0, 'Yes' : 1}) heart['AHD'] = heart['AHD'].map(t) heart.head()
With all of our columns converted, we next fill in the missing values with 0
:
heart.fillna(0, inplace=True) heart.head()
We now have a full DataFrame containing only numeric values. Let's check the shape of the DataFrame:
heart.shape
Next, we need to create two matrices for our model to use: one with the training data and one with the outcomes. The outcome is what our model will predict:
heart_data = heart.iloc[:,0:13].values heart_targets = heart['AHD'].values
After that, we import the linear_model
class from scikit-learn
, and create an instance of the model. We pass in two arguments:
C
: The regularization parameter; it is used to prevent overfitting, which is when the model describes random errors or noise instead of the underlying relationship in the data. C creates a balance between our goals of fitting the model to the training set well, and keeping our parameters small. This keeps our model simple.random_state
: The seed of the pseudo random number generator to use when shuffling the data:from sklearn import linear_model logClassifier = linear_model.LogisticRegression(C=1, random_state=111)
Next, we implement cross validation which is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. In other words, we want to ensure, as much as possible, that our model will work with any new data and not just the dataset we're using to train it.
We import the cross_validation
library and use the train_test_split()
method which splits arrays or matrices into random train and test subsets. This allows us to perform our validation. The parameters we use are:
arrays
: The data, in this recipe: the heart_data
and heart_targets
DataFramestest_size
: The percentage of the data to use for training; the rest is used for testingrandom_state
: The pseudo-random number generator state used for random sampling:from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split(heart_data, heart_targets, test_size=0.20, random_state=111)
After that we fit the data to the model, which produces our model:
logClassifier.fit(X_train, y_train)
Next we estimate the accuracy of the model on our dataset using cross_val_score
. This method splits the data, fits the model, and computes the score 12 consecutive times with different splits each time:
scores = cross_validation.cross_val_score(logClassifier, heart_data, heart_targets, cv=12) scores
Once we have the score, we print out the mean accuracy score and the standard deviation:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) Accuracy: 0.83 (+/- 0.15)
Next we run the test data through the classifier to create the predictions, and then display them.
predicted = logClassifier.predict(X_test) predicted
After that, we import the metrics
module and evaluate the accuracy of the model:
from sklearn import metrics metrics.accuracy_score(y_test, predicted)
80 percent accuracy isn't all that great, especially when attempting to predict a medical diagnosis. It does, however, match closely to the mean accuracy score that we computed earlier. The mean accuracy score was literally that—the mean of the accuracy scores of our 12 cross validation runs. The accuracy score here is the accuracy of the model when predicting against our test data.
There are a number of potential reasons for the accuracy being so low, one being that we don't have much data to work with. Regardless, we then create a confusion matrix to view the predictions:
metrics.confusion_matrix(y_test, predicted)
A confusion matrix shows the predictions that the model made on the test data. You read it as follows:
Another way to read it is as follows:
Confusion Matrix | |
---|---|
True Positives (24) |
False Positives (7) |
False Negatives (5) |
True Negatives (25) |
So our model has correctly predicted 24 out of 31 positive (1) predictions, and 25 out of 30 negative (0) predictions.