Using many Decision Trees – random forests

In this recipe, we'll use random forests for classification tasks. random forests are used because they're very robust to overfitting and perform well in a variety of situations.

Getting ready

We'll explore this more in the How it works... section of this recipe, but random forests work by constructing a lot of very shallow trees, and then taking a vote of the class that each tree "voted" for. This idea is very powerful in machine learning. If we recognize that a simple trained classifier might only be 60 percent accurate, we can train lots of classifiers that are generally right and can then use the learners together.

How to do it…

The mechanics of training a random forest classifier is very easy with scikit-learn. In this section, we'll do the following:

  1. Create a sample dataset to practice with.
  2. Train a basic random forest object.
  3. Take a look at some of the attributes of a trained object.

In the next recipe, we'll look at how to tune the random forest classifier. Let's start by importing datasets:

>>> from sklearn import datasets

Then, create the dataset with 1,000 samples:

>>> X, y = datasets.make_classification(1000)

Now that we have the data, we can create a classifier object and train it:

>>> from sklearn.ensemble import RandomForestClassifier

>>> rf = RandomForestClassifier()

>>> rf.fit(X, y)

The first thing we want to do is see how well we fit the training data. We can use the predict method for these projections:

>>> print "Accuracy:	", (y == rf.predict(X)).mean()
Accuracy:   0.993

>>> print "Total Correct:	", (y == rf.predict(X)).sum()
Total Correct:   993

Now, let's look at some attributes and methods.

First, we'll look at some of the useful attributes; in this case, since we used defaults, they'll be the object defaults:

  • rf.criterion: This is the criterion for how the splits are determined. The default is gini.
  • rf.bootstrap: A Boolean that indicates whether we used bootstrap samples when training random forest.
  • rf.n_jobs: The number of jobs to train and predict. If you want to use all the processors, set this to -1. Keep in mind that if your dataset isn't very big, it often leads to more overhead in using multiple jobs due to the data having to be serialized and moved in between processes.
  • rf.max_features: This denotes the number of features to consider when making the best split. This will come in handy during the tuning process.
  • rf.compute_importances: This helps us decide whether to compute the importance of the features. See the There's more... section of this recipe for information on how to use this.
  • rf.max_depth: This denotes how deep each tree can go.

There are more attributes to note; check out the official documentation for more details.

The predict method isn't the only useful one. We can also get the probabilities of each class from individual samples. This can be a useful feature to understand the uncertainty in each prediction. For instance, we can predict the probabilities of each sample for the various classes:

>>> probs = rf.predict_proba(X)

>>> import pandas as pd

>>> probs_df = pd.DataFrame(probs, columns=['0', '1'])
>>> probs_df['was_correct'] = rf.predict(X) == y

>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> probs_df.groupby('0').was_correct.mean().plot(kind='bar', ax=ax)
>>> ax.set_title("Accuracy at 0 class probability")
>>> ax.set_ylabel("% Correct")
>>> ax.set_xlabel("% trees for 0")

The following is the output:

How to do it…

How it works…

Random forest works by using a predetermined number of weak Decision Trees and by training each one of these trees on a subset of data. This is critical in avoiding overfitting. This is also the reason for the bootstrap parameter. We have each tree trained with the following:

  • The class with the most votes
  • The output, if we use regression trees

There are, of course, performance considerations, which we'll cover in the next recipe, but for the purposes of understanding how random forests work, we train a bunch of average trees and get a fairly good classifier as a result.

There's more…

Feature importance is a good by-product of random forests. This often helps to answer the question: If we have 10 features, which features are most important in determining the true class of the data point? The real-world applications are hopefully easy to see. For example, if a transaction is fraudulent, we probably want to know if there are certain signals that can be used to figure out a transaction's class more quickly.

If we want to calculate the feature importance, we need to state it when we create the object. If you use scikit-learn 0.15, you might get a warning that it is not required; in Version 0.16, the warning will be removed:

>>> rf = RandomForestClassifier(compute_importances=True)
>>> rf.fit(X, y)
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.bar(range(len(rf.feature_importances_)), 
           rf.feature_importances_)
>>> ax.set_title("Feature Importances")

The following is the output:

There's more…

As we can see, certain features are much more important than others when determining if the outcome was of class 0 or class 1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset