A random forest is an ensemble (a group) of decision trees which will output a prediction value.
For this recipe, we are going to use the Heart dataset from An Introduction to Statistical Learning with Applications in R.
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv' heart = pd.read_csv(data_file, sep=',', header=0, index_col=0, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False ) heart.head()
heart.dtypes
heart.shape
t2 = pd.Series({'asymptomatic' : 1, 'nonanginal' : 2, 'nontypical' : 3, 'typical': 4}) heart['ChestPain'] = heart['ChestPain'].map(t2) t = pd.Series({'fixed' : 1, 'normal' : 2, 'reversible' : 3}) heart['Thal'] = heart['Thal'].map(t) t = pd.Series({'No' : 0, 'Yes' : 1}) heart['AHD'] = heart['AHD'].map(t) heart.fillna(0, inplace=True) heart.head()
from sklearn.ensemble import RandomForestClassifier
rfClassifier = RandomForestClassifier(n_estimators = 100)
AHD
labels, and create the decision trees:rfClassifier = rfClassifier.fit(X_train, y_train)
predicted = rfClassifier.predict(X_test) predicted
scores = cross_validation.cross_val_score(rfClassifier, heart_data, heart_targets, cv=12) scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
accuracy_score
:metrics.accuracy_score(y_test, predicted)
metrics.confusion_matrix(y_test, predicted)
The first thing we need to do is import all the Python libraries that we'll need. The last line of code—%matplotlib inline
—is required only if you are running the code in IPython Notebook:
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
Next we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code:
data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/Heart.csv'
Once you have the data file variable, use the read_csv()
function provided by Pandas to create a DataFrame from the CSV file:
heart = pd.read_csv(data_file, sep=',', header=0, index_col=0, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False )
If using IPython Notebook, use the head()
function to view the top five rows of the DataFrame. This ensures that the data is imported correctly:
heart.head()
After we verify that the data is imported, we clean the data by replacing all the text values with numeric values, and filling any empty values with 0
.
t2 = pd.Series({'asymptomatic' : 1, 'nonanginal' : 2, 'nontypical' : 3, 'typical': 4}) heart['ChestPain'] = heart['ChestPain'].map(t2) t = pd.Series({'fixed' : 1, 'normal' : 2, 'reversible' : 3}) heart['Thal'] = heart['Thal'].map(t) t = pd.Series({'No' : 0, 'Yes' : 1}) heart['AHD'] = heart['AHD'].map(t) heart.fillna(0, inplace=True) heart.head()
Next we import the random forest library in order to create a new instance of a RandomForestClassifier
:
from sklearn.ensemble import RandomForestClassifier
After that, we create an instance of the RandomForestClassifier
. We will use all the default arguments; however, we will specify the n_estimators
, which is the number of trees in the forest:
rfClassifier = RandomForestClassifier(n_estimators = 100)
Next we fit the training data to the AHD
labels, and create the decision trees:
rfClassifier = rfClassifier.fit(X_train, y_train)
After that, we use the classifier on the test data to create the predictions, and then display them:
predicted = rfClassifier.predict(X_test) predicted
As in the previous recipe, we estimate the accuracy of the model using 12-part cross validation, and print out the mean accuracy score and standard deviation:
scores = cross_validation.cross_val_score(rfClassifier, heart_data, heart_targets, cv=12) print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Next we display the accuracy score for the model when compared to the outcomes in the test data:
metrics.accuracy_score(y_test, predicted)
Lastly, we create the confusion matrix, comparing the test data against the predictions created by the model:
metrics.confusion_matrix(y_test, predicted)
If you ran the Logistic Regression recipe, you'll notice that the accuracy score for the random forest is even lower than the Logistic Regression.