Random forest regression

In this section, we will start by focusing on the prostate data again. Before moving on to the breast cancer and Pima Indian sets. We will use the randomForest package. The general syntax to create a random forest object is to use the randomForest() function and specify the formula and dataset as the two primary arguments. Recall that for regression, the default variable sample per tree iteration is p/3, and for classification, it is the square root of p, where p is equal to the number of predictor variables in the data frame. For larger datasets, in terms of p, you can tune the mtry parameter, which will determine the number of p sampled at each iteration. If p is less than 10 in these examples, we will forgo this procedure. When you want to optimize mtry for larger p datasets, you can utilize the caret package or use the tuneRF() function in randomForest. With this, let's build our forest and examine the results, as follows:

  > set.seed(123)
  > rf.pros <- randomForest(lpsa ~ ., data = pros.train)
  > rf.pros
  Call:
  randomForest(formula = lpsa ~ ., data = pros.train) 
  Type of random forest: regression
  Number of trees: 500
  No. of variables tried at each split: 2
  Mean of squared residuals: 0.6792314
  % Var explained: 52.73

The call of the rf.pros object shows us that the random forest generated 500 different trees (the default) and sampled two variables at each split. The result is an MSE of 0.68 and nearly 53 percent of the variance explained. Let's see if we can improve on the default number of trees. Too many trees can lead to overfitting; naturally, how many is too many depends on the data. Two things can help out, the first one is a plot of rf.pros and the other is to ask for the minimum MSE:

  > plot(rf.pros)

The output of the preceding command is as follows:

This plot shows the MSE by the number of trees in the model. You can see that as the trees are added, significant improvement in MSE occurs early on and then flatlines just before 100 trees are built in the forest.

We can identify the specific and optimal tree with the which.min() function, as follows:

  > which.min(rf.pros$mse)
  [1] 75

We can try 75 trees in the random forest by just specifying ntree=75 in the model syntax:

  > set.seed(123)
  > rf.pros.2 <- randomForest(lpsa ~ ., data = pros.train, ntree 
   =75)
  > rf.pros.2
  Call:
  randomForest(formula = lpsa ~ ., data = pros.train, ntree = 75) 
  Type of random forest: regression
  Number of trees: 75
  No. of variables tried at each split: 2
  Mean of squared residuals: 0.6632513
  % Var explained: 53.85

You can see that the MSE and variance explained have both improved slightly. Let's see another plot before testing the model. If we are combining the results of 75 different trees that are built using bootstrapped samples and only two random predictors, we will need a way to determine the drivers of the outcome. One tree alone cannot be used to paint this picture, but you can produce a variable importance plot and corresponding list. The y-axis is a list of variables in descending order of importance and the x-axis is the percentage of improvement in MSE. Note that for the classification problems, this will be an improvement in the Gini index. The function is varImpPlot():

  > varImpPlot(rf.pros.2, scale = T, 
   main = "Variable Importance Plot - PSA Score")

The output of the preceding command is as follows:

Consistent with the single tree, lcavol is the most important variable and lweight is the second-most important variable. If you want to examine the raw numbers, use the importance() function, as follows:

  > importance(rf.pros.2)
      IncNodePurity
    lcavol  24.108641
   lweight  15.721079
     age  6.363778
     lbph  8.842343
     svi  9.501436
     lcp  9.900339
   gleason  0.000000
    pgg45  8.088635

Now, it is time to see how it did on the test data:

  > rf.pros.test <- predict(rf.pros.2, newdata = pros.test)
  > rf.resid = rf.pros.test - pros.test$lpsa #calculate residual
  > mean(rf.resid^2)
  [1] 0.5136894

The MSE is still higher than our 0.44 that we achieved in Chapter 4, Advanced Feature Selection in Linear Models with LASSO and no better than just a single tree.

Table of Contents for Random forest regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Random forest regression