Tuning a random forest model

In the previous recipe, we reviewed how to use the random forest classifier. In this recipe, we'll walk through how to tune its performance by tuning its parameters.

Getting ready

In order to tune a random forest model, we'll need to first create a dataset that's a little more difficult to predict. Then, we'll alter the parameters and do some preprocessing to fit the dataset better.

So, let's create the dataset first:

>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=10000, 
                                        n_features=20, 
                                        n_informative=15, 
                                        flip_y=.5, weights=[.2, .8])

How to do it…

In this recipe, we will do the following:

  1. Create a training and test set. We won't just sail through this recipe like we did in the previous recipe. It's an empty deed to tune a model without comparing it to a training set.
  2. Fit a baseline random forest to evaluate how well we do with a naive algorithm.
  3. Alter some parameters in a systematic way, and then observe what happens to the fit.

Ok, start an interpreter and import NumPy:

>>> import numpy as np
>>> training = np.random.choice([True, False], p=[.8, .2], 
                                size=y.shape)

>>> from sklearn.ensemble import RandomForestClassifier

>>> rf = RandomForestClassifier()
>>> rf.fit(X[training], y[training])

>>> preds = rf.predict(X[~training])

>>> print "Accuracy:	", (preds == y[~training]).mean()
Accuracy: 0.652239557121

I'm going to cheat a little bit and introduce one of the model evaluation metrics we will talk about later in the book. Accuracy is a good first metric, but using a confusion matrix will help us understand what's going on.

Let's iterate through the recommended choices for max_features and see what it does to the fit. We'll also iterate through a couple of floats, which are the fraction of the features that will be used. Use the following commands to do so:

>>> from sklearn.metrics import confusion_matrix

>>> max_feature_params = ['auto', 'sqrt', 'log2', .01, .5, .99]

>>> confusion_matrixes = {}

>>> for max_feature in max_feature_params:
       rf = RandomForestClassifier(max_features=max_feature)
       rf.fit(X[training], y[training])
    
>>> confusion_matrixes[max_feature] = confusion_matrix(y[~training])

>>> rf.predict(X[~training])).ravel()

Since I used the ravel method, our 2D confusion matrices are now 1D.

Now, import pandas and look at the confusion matrix we just created:

>>> import pandas as pd

>>> confusion_df = pd.DataFrame(confusion_matrixes)

>>> import itertools
>>> from matplotlib import pyplot as plt
>>> f, ax = plt.subplots(figsize=(7, 5))

>>> confusion_df.plot(kind='bar', ax=ax)

>>> ax.legend(loc='best')

>>> ax.set_title("Guessed vs Correct (i, j) where i is the guess and j is 
                 the actual.")

>>> ax.grid()

>>> ax.set_xticklabels([str((i, j)) for i, j in 
                       list(itertools.product(range(2), range(2)))]);
>>> ax.set_xlabel("Guessed vs Correct")
>>> ax.set_ylabel("Correct")

The following is the output:

How to do it…

While we didn't see any real difference in performance, this is a fairly simple process to go through for your own projects. Let's try it on the choice of n_estimator instances, but use raw accuracy. With more than a few options, our graph is going to become very cloudy and difficult to use.

Since we're using the confusion matrix, we can get the accuracy from the trace of the confusion matrix divided by the overall sum:

>>> n_estimator_params = range(1, 20)

>>> confusion_matrixes = {}

>>> for n_estimator in n_estimator_params:
       rf = RandomForestClassifier(n_estimators=n_estimator)
    
       rf.fit(X[training], y[training])
    
       confusion_matrixes[n_estimator] = confusion_matrix(y[~training], 
                                         rf.predict(X[~training]))
    # here's where we'll update the confusion matrix with the 
      operation we talked about
    
>>> accuracy = lambda x: np.trace(x) / np.sum(x, dtype=float)
>>> confusion_matrixes[n_estimator] = 
                         accuracy(confusion_matrixes[n_estimator])

>>> accuracy_series = pd.Series(confusion_matrixes) 

>>> import itertools
>>> from matplotlib import pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> accuracy_series.plot(kind='bar', ax=ax, color='k', alpha=.75)
>>> ax.grid()

>>> ax.set_title("Accuracy by Number of Estimators")
>>> ax.set_ylim(0, 1) # we want the full scope
>>> ax.set_ylabel("Accuracy")
>>> ax.set_xlabel("Number of Estimators")

The following is the output:

How to do it…

Notice how accuracy is going up for the most part. There certainly is some randomness associated with the accuracy, but the graph is up and to the right. In the following How it works... section, we'll talk about the association between random forest and bootstrap, and what is generally better.

How it works…

Bootstrapping is a nice technique to augment the other parts of modeling. The case often used to introduce bootstrapping is adding standard errors to a median. Here, we just estimate the outcome over and over and aggregate the estimates up to probabilities.

So, by simply increasing the number estimators, we increase the subsamples that lead to an overall faster convergence.

There's more…

We might want to speed up the training process. I alluded to this process earlier, but we can set n_jobs to the number of trees we want to train at the same time. This should roughly be the number of cores on the machine:

>>> rf = RandomForestClassifier(n_jobs=4, verbose=True)
>>> rf.fit(X, y)
[Parallel(n_jobs=4)]: Done  1 out of  4 | elapsed:  0.3s remaining: 0.9s 
[Parallel(n_jobs=4)]: Done  4 out of  4 | elapsed:  0.3s finished

This will also predict in parallel (verbosely):

>>> rf.predict(X)
[Parallel(n_jobs=4)]: Done  1 out of  4 | elapsed:  0.0s remaining:    0.0s 
[Parallel(n_jobs=4)]: Done  4 out of  4 | elapsed:  0.0s finished 

array([1, 1, 0, ..., 1, 1, 1])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset