Random forest classification

Perhaps you are disappointed with the performance of the random forest regression model, but the true power of the technique is in the classification problems. Let's get started with the breast cancer diagnosis data. The procedure is nearly the same as we did with the regression problem:

  > set.seed(123) 
  > rf.biop <- randomForest(class ~. , data = biop.train)
  > rf.biop
  Call:
   randomForest(formula = class ~ ., data = biop.train)
          Type of random forest: classification
             Number of trees: 500
  No. of variables tried at each split: 3
      OOB estimate of error rate: 3.16%
  Confusion matrix:
       benign malignant class.error
  benign    294     8 0.02649007
  malignant   7    165 0.04069767

The OOB error rate is 3.16%. Again, this is with all the 500 trees factored into the analysis. Let's plot the Error by trees:

  > plot(rf.biop)

The output of the preceding command is as follows:

The plot shows that the minimum error and standard error is the lowest with quite a few trees. Let's now pull the exact number using which.min() again. The one difference from before is that we need to specify column 1 to get the error rate. This is the overall error rate and there will be additional columns for each error rate by the class label. We will not need them in this example. Also, mse is no longer available but rather err.rate is used instead, as follows:

  > which.min(rf.biop$err.rate[, 1])
  [1] 19

Only 19 trees are needed to optimize the model accuracy. Let's try this and see how it performs:

  > set.seed(123)
  > rf.biop.2 <- randomForest(class~ ., data = biop.train, ntree = 
   19)
  > print(rf.biop.2) 
  Call:
   randomForest(formula = class ~ ., data = biop.train, ntree = 19)
          Type of random forest: classification
             Number of trees: 19
  No. of variables tried at each split: 3
      OOB estimate of error rate: 2.95%
  Confusion matrix:
       benign malignant class.error
  benign    294     8 0.02649007
  malignant   6    166 0.03488372
  > rf.biop.test <- predict(rf.biop.2, newdata = biop.test, type = 
   "response")
  > table(rf.biop.test, biop.test$class)
  rf.biop.test benign malignant
    benign    139     0
    malignant   3    67
  > (139 + 67) / 209
  [1] 0.9856459

Well, how about that? The train set error is below 3 percent, and the model even performs better on the test set where we had only three observations misclassified out of 209 and none were false positives. Recall that the best so far was with logistic regression with 97.6 percent accuracy. So this seems to be our best performer yet on the breast cancer data. Before moving on, let's have a look at the variable importance plot:

  > varImpPlot(rf.biop.2)

The output of the preceding command is as follows:

The importance in the preceding plot is in each variable's contribution to the mean decrease in the Gini index. This is rather different from the splits of the single tree. Remember that the full tree had splits at the size (consistent with random forest), then nuclei, and then thickness. This shows how potentially powerful a technique building random forests can be, not only in the predictive ability, but also in feature selection.

Moving on to the tougher challenge of the Pima Indian diabetes model, we will first need to prepare the data in the following way:

  > data(Pima.tr)
  > data(Pima.te)
  > pima <- rbind(Pima.tr, Pima.te)
  > set.seed(502)
  > ind <- sample(2, nrow(pima), replace = TRUE, prob = c(0.7, 0.3))
  > pima.train <- pima[ind == 1, ]
  > pima.test <- pima[ind == 2, ]

Now we will move on to the building of the model, as follows:

  > set.seed(321) 
  > rf.pima = randomForest(type~., data=pima.train)
  > rf.pima
  Call:
   randomForest(formula = type ~ ., data = pima.train)
          Type of random forest: classification
             Number of trees: 500
  No. of variables tried at each split: 2
      OOB estimate of error rate: 20%
  Confusion matrix:
     No Yes class.error
  No 233 29  0.1106870
  Yes 48 75  0.3902439

We get a 20 per cent misclassification rate error, which is no better than what we've done before on the train set. Let's see if optimizing the tree size can improve things dramatically:

  > which.min(rf.pima$err.rate[, 1])
  [1] 80
  > set.seed(321)
  > rf.pima.2 = randomForest(type~., data=pima.train, ntree=80)
  > print(rf.pima.2)
  Call:
   randomForest(formula = type ~ ., data = pima.train, ntree = 80)
          Type of random forest: classification
             Number of trees: 80
  No. of variables tried at each split: 2
      OOB estimate of error rate: 19.48%
  Confusion matrix:
     No Yes class.error
  No 230 32  0.1221374
  Yes 43 80  0.3495935

At 80 trees in the forest, there is minimal improvement in the OOB error. Can random forest live up to the hype on the test data? We will see in the following way:

  > rf.pima.test <- predict(rf.pima.2, newdata= pima.test, 
   type = "response")
  > table(rf.pima.test, pima.test$type)
  rf.pima.test No Yes
       No 75 21
       Yes 18 33
  > (75+33)/147
  [1] 0.7346939

Well, we get only 73 percent accuracy on the test data, which is inferior to what we achieved using the SVM.

While random forest disappointed on the diabetes data, it proved to be the best classifier so far for the breast cancer diagnosis. Finally, we will move on to gradient boosting.

Table of Contents for Random forest classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Random forest classification