Creating a predictive model using a random forest

A random forest is an ensemble (a group) of decision trees which will output a prediction value.

For this recipe, we are going to use the Heart dataset from An Introduction to Statistical Learning with Applications in R.

How to do it…

  1. First, import the Python libraries that you need:
    import pandas as pd
    import numpy as np
    import matplotlib as plt
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Next, define a variable for the heart data file, import the data, and view the top five rows:
    data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'
    heart = pd.read_csv(data_file,
                            sep=',',
                            header=0,
                            index_col=0,
                            parse_dates=True,
                            tupleize_cols=False,
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            skip_blank_lines=True,
                            low_memory=False
                            )
    heart.head()
  3. After that, get a full list of the columns and data types in the DataFrame:
    heart.dtypes
  4. Find out the number of rows and columns in the DataFrame:
    heart.shape
  5. As in the Logistic Regression recipe, we convert all the non-numeric values to numeric values, fill in missing values with 0, and view the results:
    t2 = pd.Series({'asymptomatic' : 1,
                    'nonanginal' : 2,
                    'nontypical' : 3,
                    'typical': 4})
    heart['ChestPain'] = heart['ChestPain'].map(t2)
    t = pd.Series({'fixed' : 1,
                   'normal' : 2,
                   'reversible' : 3})
    heart['Thal'] = heart['Thal'].map(t)
    t = pd.Series({'No' : 0,
                   'Yes' : 1})
    heart['AHD'] = heart['AHD'].map(t)
    heart.fillna(0, inplace=True)
    heart.head()
  6. Import the random forest library:
    from sklearn.ensemble import RandomForestClassifier
  7. Create an instance of a random forest model:
    rfClassifier = RandomForestClassifier(n_estimators = 100)
  8. Fit the training data to the AHD labels, and create the decision trees:
    rfClassifier = rfClassifier.fit(X_train, y_train)
  9. Take the same decision trees, and run them on the test data:
    predicted = rfClassifier.predict(X_test)
    predicted
  10. Estimate the accuracy of the model on the dataset:
    scores = cross_validation.cross_val_score(rfClassifier, heart_data, heart_targets, cv=12)
    scores
  11. Show the mean accuracy score and the standard deviation:
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  12. Assess the model using the accuracy_score:
    metrics.accuracy_score(y_test, predicted)
  13. Show the confusion matrix:
    metrics.confusion_matrix(y_test, predicted)

How it works…

The first thing we need to do is import all the Python libraries that we'll need. The last line of code—%matplotlib inline—is required only if you are running the code in IPython Notebook:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code:

data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'

Once you have the data file variable, use the read_csv() function provided by Pandas to create a DataFrame from the CSV file:

heart = pd.read_csv(data_file,
                        sep=',',
                        header=0,
                        index_col=0,
                        parse_dates=True,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        skip_blank_lines=True,
                        low_memory=False
                        )

If using IPython Notebook, use the head() function to view the top five rows of the DataFrame. This ensures that the data is imported correctly:

heart.head()

After we verify that the data is imported, we clean the data by replacing all the text values with numeric values, and filling any empty values with 0.

t2 = pd.Series({'asymptomatic' : 1,
                'nonanginal' : 2,
                'nontypical' : 3,
                'typical': 4})
heart['ChestPain'] = heart['ChestPain'].map(t2)
t = pd.Series({'fixed' : 1,
               'normal' : 2,
               'reversible' : 3})
heart['Thal'] = heart['Thal'].map(t)
t = pd.Series({'No' : 0,
               'Yes' : 1})
heart['AHD'] = heart['AHD'].map(t)
heart.fillna(0, inplace=True)
heart.head()

Next we import the random forest library in order to create a new instance of a RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier

After that, we create an instance of the RandomForestClassifier. We will use all the default arguments; however, we will specify the n_estimators, which is the number of trees in the forest:

rfClassifier = RandomForestClassifier(n_estimators = 100)

Next we fit the training data to the AHD labels, and create the decision trees:

rfClassifier = rfClassifier.fit(X_train, y_train)
How it works…

After that, we use the classifier on the test data to create the predictions, and then display them:

predicted = rfClassifier.predict(X_test)
predicted
How it works…

As in the previous recipe, we estimate the accuracy of the model using 12-part cross validation, and print out the mean accuracy score and standard deviation:

scores = cross_validation.cross_val_score(rfClassifier, heart_data, heart_targets, cv=12)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
How it works…

Next we display the accuracy score for the model when compared to the outcomes in the test data:

metrics.accuracy_score(y_test, predicted)
How it works…

Lastly, we create the confusion matrix, comparing the test data against the predictions created by the model:

metrics.confusion_matrix(y_test, predicted)
How it works…

If you ran the Logistic Regression recipe, you'll notice that the accuracy score for the random forest is even lower than the Logistic Regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset