Creating a predictive model using logistic regression

Logistic regression is a statistical technique used to predict a binary outcome, for example, purchase/no-purchase.

For this recipe, we are going to use the Heart dataset from An Introduction to Statistical Learning with Applications in R.

How to do it…

  1. First, import the Python libraries that you need:
    import pandas as pd
    import numpy as np
    import matplotlib as plt
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Next, define a variable for the heart data file, import the data, and view the top five rows:
    data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'
    heart = pd.read_csv(data_file,
                            sep=',',
                            header=0,
                            index_col=0,
                            parse_dates=True,
                            tupleize_cols=False,
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            skip_blank_lines=True,
                            low_memory=False
                            )
    heart.head()
  3. After that, get a full list of the columns and data types in the DataFrame:
    heart.dtypes
  4. Find out the number of rows and columns in the DataFrame:
    heart.shape
  5. Convert the ChestPain column to a numeric value:
    t2 = pd.Series({'asymptomatic' : 1,
                    'nonanginal' : 2,
                    'nontypical' : 3,
                    'typical': 4})
    heart['ChestPain'] = heart['ChestPain'].map(t2)
    heart.head()
  6. Convert the Thal column to a numeric value:
    t = pd.Series({'fixed' : 1,
                   'normal' : 2,
                   'reversible' : 3})
    heart['Thal'] = heart['Thal'].map(t)
    heart.head()
  7. Convert the AHD column to a numeric value:
    t = pd.Series({'No' : 0,
                   'Yes' : 1})
    heart['AHD'] = heart['AHD'].map(t)
    heart.head()
  8. Fill in the missing values with 0:
    heart.fillna(0, inplace=True)
    heart.head()
  9. Get the current shape of the DataFrame:
    heart.shape
  10. Create two matrices for our model to use: one with the data and one with the outcomes:
    heart_data = heart.iloc[:,0:13].values
    heart_targets = heart['AHD'].values
  11. Import the model class from scikit-learn and build the model:
    from sklearn import linear_model
    logClassifier = linear_model.LogisticRegression(C=1, random_state=111)
  12. Implement cross validation for our model:
    from sklearn import cross_validation
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(heart_data,
       heart_targets,
       test_size=0.20,
             random_state=111)
  13. Estimate the accuracy of the model on our dataset using cross_val_score:
    scores = cross_validation.cross_val_score(logClassifier, heart_data, heart_targets, cv=12)
    scores
  14. Show the mean accuracy score and the standard deviation:
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  15. Fit the data to the model:
    logClassifier.fit(X_train, y_train)
  16. Run the test data through the classifier to get the predictions:
    predicted = logClassifier.predict(X_test)
  17. Import the metrics module and evaluate the accuracy of the model:
    from sklearn import metrics
    metrics.accuracy_score(y_test, predicted)
  18. Finally, view the confusion matrix:
    metrics.confusion_matrix(y_test, predicted)

How it works…

The first thing that we need to do is to import all the Python libraries that we'll need. The last line of code—%matplotlib inline—is required only if you are running the code in IPython Notebook:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next, we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code.

data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'

Once you have the data file variable, use the read_csv() function provided by Pandas to create a DataFrame from the CSV file.

heart = pd.read_csv(data_file,
                    sep=',',
                    header=0,
                    index_col=0,
                    parse_dates=True,
                    tupleize_cols=False,
                    error_bad_lines=False,
                    warn_bad_lines=True,
                    skip_blank_lines=True,
                    low_memory=False
                    )

If using IPython Notebook, use the head() function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly:

heart.head()

Next we use the shape function to find out the number of rows and columns in the DataFrame:

heart.shape

In order to feed the data to our model, we need to convert all text data to integers. We start by converting the ChestPain column to a numeric value.

We first create a Pandas Series object, and assign numeric values to each unique value in the ChestPain column:

t2 = pd.Series({'asymptomatic' : 1,
                'nonanginal' : 2,
                'nontypical' : 3,
                'typical': 4})

Next we use the built-in Python map() function which will go through the ChestPain column of the DataFrame and replace the text value with the corresponding integer value from the series we created. To verify the results, we use the head() function:

heart['ChestPain'] = heart['ChestPain'].map(t2)
heart.head()
How it works…

We then use the same method to convert the Thal column to a numeric value:

t = pd.Series({'fixed' : 1,
               'normal' : 2,
               'reversible' : 3})
heart['Thal'] = heart['Thal'].map(t)
heart.head()
How it works…

After that we convert the AHD column to a numeric value as well:

t = pd.Series({'No' : 0,
               'Yes' : 1})
heart['AHD'] = heart['AHD'].map(t)
heart.head()

With all of our columns converted, we next fill in the missing values with 0:

heart.fillna(0, inplace=True)
heart.head()
How it works…

We now have a full DataFrame containing only numeric values. Let's check the shape of the DataFrame:

heart.shape

Next, we need to create two matrices for our model to use: one with the training data and one with the outcomes. The outcome is what our model will predict:

heart_data = heart.iloc[:,0:13].values
heart_targets = heart['AHD'].values

After that, we import the linear_model class from scikit-learn, and create an instance of the model. We pass in two arguments:

  • C: The regularization parameter; it is used to prevent overfitting, which is when the model describes random errors or noise instead of the underlying relationship in the data. C creates a balance between our goals of fitting the model to the training set well, and keeping our parameters small. This keeps our model simple.
  • random_state: The seed of the pseudo random number generator to use when shuffling the data:
    from sklearn import linear_model
    logClassifier = linear_model.LogisticRegression(C=1, random_state=111)

Next, we implement cross validation which is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. In other words, we want to ensure, as much as possible, that our model will work with any new data and not just the dataset we're using to train it.

We import the cross_validation library and use the train_test_split() method which splits arrays or matrices into random train and test subsets. This allows us to perform our validation. The parameters we use are:

  • arrays: The data, in this recipe: the heart_data and heart_targets DataFrames
  • test_size: The percentage of the data to use for training; the rest is used for testing
  • random_state: The pseudo-random number generator state used for random sampling:
    from sklearn import cross_validation
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(heart_data,
       heart_targets,
       test_size=0.20,
       random_state=111)

After that we fit the data to the model, which produces our model:

logClassifier.fit(X_train, y_train)
How it works…

Next we estimate the accuracy of the model on our dataset using cross_val_score. This method splits the data, fits the model, and computes the score 12 consecutive times with different splits each time:

scores = cross_validation.cross_val_score(logClassifier, heart_data, heart_targets, cv=12)
scores

Once we have the score, we print out the mean accuracy score and the standard deviation:

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.83 (+/- 0.15)

Next we run the test data through the classifier to create the predictions, and then display them.

predicted = logClassifier.predict(X_test)
predicted
How it works…

After that, we import the metrics module and evaluate the accuracy of the model:

from sklearn import metrics
metrics.accuracy_score(y_test, predicted)
How it works…

80 percent accuracy isn't all that great, especially when attempting to predict a medical diagnosis. It does, however, match closely to the mean accuracy score that we computed earlier. The mean accuracy score was literally that—the mean of the accuracy scores of our 12 cross validation runs. The accuracy score here is the accuracy of the model when predicting against our test data.

There are a number of potential reasons for the accuracy being so low, one being that we don't have much data to work with. Regardless, we then create a confusion matrix to view the predictions:

metrics.confusion_matrix(y_test, predicted)
How it works…

A confusion matrix shows the predictions that the model made on the test data. You read it as follows:

  • Diagonal from the top-left corner to the bottom-right corner is the number of correct predictions for each row
  • A number in a non-diagonal row is the count of errors for that row
  • The column corresponds to the incorrect prediction

Another way to read it is as follows:

Confusion Matrix

 

True Positives (24)

False Positives (7)

False Negatives (5)

True Negatives (25)

So our model has correctly predicted 24 out of 31 positive (1) predictions, and 25 out of 30 negative (0) predictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset