Random forests is a predictive algorithm falling under the ambit of ensemble learning algorithms. Ensemble learning algorithms consist of a combination of various independent models (similar or different) to solve a particular prediction problem. The final result is calculated based on the results from all these independent models, which is better than the results of any of the independent models.
There are two kinds of ensemble algorithm, as follows:
Random forest, as the name implies, is a collection of classifier or regression trees. A random forest algorithm creates trees at random and then averages the predictions (random forest is an averaging method of ensemble learning) of these trees.
Random forest is an easy-to-use algorithm for both classification and regression prediction problems and doesn't come with all the prerequisites that other algorithms have. Random forest is sometimes called the leatherman of all algorithms because one can use it to model any kind of dataset and find a decent result.
Random forest doesn't need a cross-validation. Instead, it uses something called Bagging. Suppose we want n observations in our training dataset T. Also, let's say there are m variables in the dataset. We decide to grow S trees in our forest. Each tree will be grown from a separate training dataset. So, there will be S training datasets. The training datasets are created by sampling n observations randomly with a replacement (n times). So, each dataset can have duplicate observations as well, and some of the observations might be missing from all the S training datasets. These datasets are called bootstrap samples or simply bags. The observations that are not part of a bag are out of the bag observation for that bag or sample.
The following is a stepwise algorithm for a random forest:
Let us fit a random forest on the same dataset and see whether there is some improvement in the error rate of the prediction:
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_jobs=2,oob_score=True,n_estimators=10) rf.fit(X,Y)
The parameters in RandomForestRegressor
have their significance. The n_jobs
is used to specify the parallelization of the computing and signifies the number of jobs running parallel for both fit and predict. The oob_score
is a binary variable. Setting it to True
means that the model has done an out-of-the-box sampling to make the predictions. The n_estimators
specifies the number of trees our random forest will have. It has been chosen to be 10 just for illustrative purposes. One can try a higher number and see whether it improves the error rate or not.
The predicted values can be obtained using the oob_prediction
attribute of the random forest:
rf.oob_prediction_ Let us now make the predictions a part of the data frame and have a look at it. data['rf_pred']=rf.oob_prediction_ cols=['rf_pred','medv'] data[cols].head()
The output of the preceding code snippet looks as follows:
The next step is to calculate a mean squared error for the prediction. For a regression tree, we specified the cross-validation scoring method to be a mean squared error; hence, we were able to obtain a mean squared error for the regression tree from the cross-validation score. In the case of random forest, as we noted earlier, a cross validation is not needed. So, to calculate a mean squared error, we can use the oob
predicted values and the actual values as follows:
data['rf_pred']=rf.oob_prediction_ data['err']=(data['rf_pred']-data['medv'])**2 sum(data['err'])/506
The mean squared error comes out to be 16.823, which is less than 20.10 obtained from the regression tree with cross-validation.
Another attribute of the random forest regressor is oob_score
, which is similar to the coefficient of the determination (or R2) used in the linear regression.
The oob_score
for a random forest can be obtained by writing the following one liner: rf.oob_score
_
The oob_score
for this random forest comes out at 0.83.
Random forests do a better job of making predictions because they average the outputs from an ensemble of trees. This maximizes the variance reduction. Also, taking a random sample of the predictors to create a tree makes the tree independent of the other trees (as they are not necessarily using the same predictors, even if using similar datasets).
Random forest is one of the algorithms where all the variables of a dataset are optimally utilized. In most machine learning algorithms, we select a bunch of variables that are the most important for an optimal prediction. However, in the case of random forest, because of the random selection of the variables and also because the final outputs in a tree are calculated at the local partitions where some of the variables that are not important globally might become significant, each variable is utilized to its full potential. Thus, the entire data is more optimally used. This helps in reducing the bias arising out of dependence on only a few of the predictors.
The following are some of the important parameters for random forests that help in fine-tuning the results of the random forest models:
min_samples_leaf
parameter of the RandomForestRegressor
.n_estimators
parameter of the RandomForestRegressor
.