Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating a predictive model using a random forest

A random forest is an ensemble (a group) of decision trees which will output a prediction value.

For this recipe, we are going to use the Heart dataset from An Introduction to Statistical Learning with Applications in R.

How to do it…

First, import the Python libraries that you need:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next, define a variable for the heart data file, import the data, and view the top five rows:

data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'
heart = pd.read_csv(data_file,
                        sep=',',
                        header=0,
                        index_col=0,
                        parse_dates=True,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        skip_blank_lines=True,
                        low_memory=False
                        )
heart.head()

After that, get a full list of the columns and data types in the DataFrame:
```
heart.dtypes
```
Find out the number of rows and columns in the DataFrame:
```
heart.shape
```

As in the Logistic Regression recipe, we convert all the non-numeric values to numeric values, fill in missing values with 0, and view the results:

t2 = pd.Series({'asymptomatic' : 1,
                'nonanginal' : 2,
                'nontypical' : 3,
                'typical': 4})
heart['ChestPain'] = heart['ChestPain'].map(t2)
t = pd.Series({'fixed' : 1,
               'normal' : 2,
               'reversible' : 3})
heart['Thal'] = heart['Thal'].map(t)
t = pd.Series({'No' : 0,
               'Yes' : 1})
heart['AHD'] = heart['AHD'].map(t)
heart.fillna(0, inplace=True)
heart.head()

Import the random forest library:

from sklearn.ensemble import RandomForestClassifier

Create an instance of a random forest model:

rfClassifier = RandomForestClassifier(n_estimators = 100)

Fit the training data to the AHD labels, and create the decision trees:
```
rfClassifier = rfClassifier.fit(X_train, y_train)
```
Take the same decision trees, and run them on the test data:
```
predicted = rfClassifier.predict(X_test)
predicted
```

Estimate the accuracy of the model on the dataset:

scores = cross_validation.cross_val_score(rfClassifier, heart_data, heart_targets, cv=12)
scores

Show the mean accuracy score and the standard deviation:

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Assess the model using the accuracy_score:

metrics.accuracy_score(y_test, predicted)

Show the confusion matrix:

metrics.confusion_matrix(y_test, predicted)

How it works…

The first thing we need to do is import all the Python libraries that we'll need. The last line of code—%matplotlib inline—is required only if you are running the code in IPython Notebook:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code:

data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'

Once you have the data file variable, use the read_csv() function provided by Pandas to create a DataFrame from the CSV file:

heart = pd.read_csv(data_file,
                        sep=',',
                        header=0,
                        index_col=0,
                        parse_dates=True,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        skip_blank_lines=True,
                        low_memory=False
                        )

If using IPython Notebook, use the head() function to view the top five rows of the DataFrame. This ensures that the data is imported correctly:

heart.head()

After we verify that the data is imported, we clean the data by replacing all the text values with numeric values, and filling any empty values with 0.

t2 = pd.Series({'asymptomatic' : 1,
                'nonanginal' : 2,
                'nontypical' : 3,
                'typical': 4})
heart['ChestPain'] = heart['ChestPain'].map(t2)
t = pd.Series({'fixed' : 1,
               'normal' : 2,
               'reversible' : 3})
heart['Thal'] = heart['Thal'].map(t)
t = pd.Series({'No' : 0,
               'Yes' : 1})
heart['AHD'] = heart['AHD'].map(t)
heart.fillna(0, inplace=True)
heart.head()

Next we import the random forest library in order to create a new instance of a RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier

After that, we create an instance of the RandomForestClassifier. We will use all the default arguments; however, we will specify the n_estimators, which is the number of trees in the forest:

rfClassifier = RandomForestClassifier(n_estimators = 100)

Next we fit the training data to the AHD labels, and create the decision trees:

rfClassifier = rfClassifier.fit(X_train, y_train)

After that, we use the classifier on the test data to create the predictions, and then display them:

predicted = rfClassifier.predict(X_test)
predicted

As in the previous recipe, we estimate the accuracy of the model using 12-part cross validation, and print out the mean accuracy score and standard deviation:

scores = cross_validation.cross_val_score(rfClassifier, heart_data, heart_targets, cv=12)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Next we display the accuracy score for the model when compared to the outcomes in the test data:

metrics.accuracy_score(y_test, predicted)

Lastly, we create the confusion matrix, comparing the test data against the predictions created by the model:

metrics.confusion_matrix(y_test, predicted)

If you ran the Logistic Regression recipe, you'll notice that the accuracy score for the random forest is even lower than the Logistic Regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Creating a predictive model using a random forest

Create new playlist

Sign In

Sign Up

Creating a predictive model using a random forest

How to do it…

How it works…

Table of Contents for
Creating a predictive model using a random forest