Understanding and implementing random forests

Random forests is a predictive algorithm falling under the ambit of ensemble learning algorithms. Ensemble learning algorithms consist of a combination of various independent models (similar or different) to solve a particular prediction problem. The final result is calculated based on the results from all these independent models, which is better than the results of any of the independent models.

There are two kinds of ensemble algorithm, as follows:

  • Averaging methods: Several similar independent models are created (in the case of decision trees, it can mean trees with different depths or trees involving a certain variable and not involving the others, and so on.) and the final prediction is given by the average of the predictions of all the models.
  • Boosting methods: The goal here is to reduce the bias of the combined estimator by sequentially building it from the base estimators. A powerful model is created using several weak models.

Random forest, as the name implies, is a collection of classifier or regression trees. A random forest algorithm creates trees at random and then averages the predictions (random forest is an averaging method of ensemble learning) of these trees.

Random forest is an easy-to-use algorithm for both classification and regression prediction problems and doesn't come with all the prerequisites that other algorithms have. Random forest is sometimes called the leatherman of all algorithms because one can use it to model any kind of dataset and find a decent result.

Random forest doesn't need a cross-validation. Instead, it uses something called Bagging. Suppose we want n observations in our training dataset T. Also, let's say there are m variables in the dataset. We decide to grow S trees in our forest. Each tree will be grown from a separate training dataset. So, there will be S training datasets. The training datasets are created by sampling n observations randomly with a replacement (n times). So, each dataset can have duplicate observations as well, and some of the observations might be missing from all the S training datasets. These datasets are called bootstrap samples or simply bags. The observations that are not part of a bag are out of the bag observation for that bag or sample.

The random forest algorithm

The following is a stepwise algorithm for a random forest:

  1. Take a random sample of size n with a replacement.
  2. Take a random sample of the predictor variables without a replacement.
  3. Construct a regression tree using the predictors chosen in the random sample in step 2. Let it grow as much as it can. Do not prune the tree.
  4. Pass the outside of the bag observations for this bootstrap sample through the current tree. Store the value or class assigned to each observation through this process.
  5. Repeat steps 1 to 4 for a large number of times or the number of times specified (this is basically the number of trees one wants in the forest).
  6. The final predicted value for an observation is the average of the predicted values for that observation over all the trees. In the case of a classifier, the final class will be decided by a majority of votes; that is, the class that gets predicted by the maximum number of trees gets to be the final prediction for that observation.

Implementing a random forest using Python

Let us fit a random forest on the same dataset and see whether there is some improvement in the error rate of the prediction:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_jobs=2,oob_score=True,n_estimators=10)
rf.fit(X,Y)

The parameters in RandomForestRegressor have their significance. The n_jobs is used to specify the parallelization of the computing and signifies the number of jobs running parallel for both fit and predict. The oob_score is a binary variable. Setting it to True means that the model has done an out-of-the-box sampling to make the predictions. The n_estimators specifies the number of trees our random forest will have. It has been chosen to be 10 just for illustrative purposes. One can try a higher number and see whether it improves the error rate or not.

The predicted values can be obtained using the oob_prediction attribute of the random forest:

rf.oob_prediction_
Let us now make the predictions a part of the data frame and have a look at it. 
data['rf_pred']=rf.oob_prediction_
cols=['rf_pred','medv']
data[cols].head()

The output of the preceding code snippet looks as follows:

Implementing a random forest using Python

Fig. 8.16: Comparing the actual and predicted values of the target variable

The next step is to calculate a mean squared error for the prediction. For a regression tree, we specified the cross-validation scoring method to be a mean squared error; hence, we were able to obtain a mean squared error for the regression tree from the cross-validation score. In the case of random forest, as we noted earlier, a cross validation is not needed. So, to calculate a mean squared error, we can use the oob predicted values and the actual values as follows:

data['rf_pred']=rf.oob_prediction_
data['err']=(data['rf_pred']-data['medv'])**2
sum(data['err'])/506

The mean squared error comes out to be 16.823, which is less than 20.10 obtained from the regression tree with cross-validation.

Another attribute of the random forest regressor is oob_score, which is similar to the coefficient of the determination (or R2) used in the linear regression.

The oob_score for a random forest can be obtained by writing the following one liner: rf.oob_score_

The oob_score for this random forest comes out at 0.83.

Why do random forests work?

Random forests do a better job of making predictions because they average the outputs from an ensemble of trees. This maximizes the variance reduction. Also, taking a random sample of the predictors to create a tree makes the tree independent of the other trees (as they are not necessarily using the same predictors, even if using similar datasets).

Random forest is one of the algorithms where all the variables of a dataset are optimally utilized. In most machine learning algorithms, we select a bunch of variables that are the most important for an optimal prediction. However, in the case of random forest, because of the random selection of the variables and also because the final outputs in a tree are calculated at the local partitions where some of the variables that are not important globally might become significant, each variable is utilized to its full potential. Thus, the entire data is more optimally used. This helps in reducing the bias arising out of dependence on only a few of the predictors.

Important parameters for random forests

The following are some of the important parameters for random forests that help in fine-tuning the results of the random forest models:

  • Node size: The trees in random forests can have very few observations in their leaf node, unlike the decision or regression trees. The trees in a random forest are allowed to grow without pruning. The goal is to reduce the bias as much as possible. This can be specified by the min_samples_leaf parameter of the RandomForestRegressor.
  • Number of trees: The number of trees in a random forest is generally set to a large number around 500. It also depends on the number of observations and columns in the dataset. This can be specified by the n_estimators parameter of the RandomForestRegressor.
  • Number of predi ctors sampled: This is an important tuning parameter determining how the tree grows independently and unbiased. Generally, it should range between 2 to 5.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset