Feature Selection with random forests

So far, we've looked at several feature selection techniques, such as regularization, best subsets, and recursive feature elimination. I now want to introduce an effective feature selection method for classification problems with Random Forests using the Boruta package. A paper is available that provides details on how it works in providing all relevant features:

Kursa M., Rudnicki W. (2010), Feature Selection with the Boruta Package, Journal of Statistical Software, 36(11), 1 - 13

What I will do here is provide an overview of the algorithm and then apply it to a wide dataset. This will not serve as a separate business case but as a template to apply the methodology. I have found it to be highly effective, but be advised it can be computationally intensive. That may seem to defeat the purpose, but it effectively eliminates unimportant features, allowing you to focus on building a simpler, more efficient, and more insightful model. It is time well spent.

At a high level, the algorithm creates shadow attributes by copying all the inputs and shuffling the order of their observations to decorrelate them. Then, a random forest model is built on all the inputs and a Z-score of the mean accuracy loss for each feature, including the shadow ones. Features with significantly higher Z-scores or significantly lower Z-scores than the shadow attributes are deemed important and unimportant respectively. The shadow attributes and those features with known importance are removed and the process repeats itself until all features are assigned an importance value. You can also specify the maximum number of random forest iterations. After completion of the algorithm, each of the original features will be labeled as confirmed, tentative, or rejected. You must decide on whether or not to include the tentative features for further modeling. Depending on your situation, you have some options:

Change the random seed and rerun the methodology multiple (k) times and select only those features that are confirmed in all the k runs
Divide your data (training data) into k folds, run separate iterations on each fold, and select those features which are confirmed for all the k folds

Note that all of this can be done with just a few lines of code. Let's have a look at the code, applying it to the Sonar data from the mlbench package. It consists of 208 observations, 60 numerical input features, and one vector of labels for classification. The labels are factors where, R if the sonar object is a rock and M if it is a mine. The first thing to do is load the data and do a quick, very quick, data exploration:

  > data(Sonar, package="mlbench")
  > dim(Sonar)
  [1] 208 61
  > table(Sonar$Class)
   M R 
  111 97

To run the algorithm, you just need to load the Boruta package and create a formula in the boruta() function. Keep in mind that the labels must be and as a factor, or the algorithm will not work. If you want to track the progress of the algorithm, specify doTrace = 1. Also, don't forget to set the random seed:

  > library(Boruta)
  > set.seed(1)
  > feature.selection <- Boruta(Class ~ ., data = Sonar, doTrace = 1)

As mentioned in the previous section, this can be computationally intensive. Here is how long it took on my old-fashioned laptop:

  > feature.selection$timeTaken
  Time difference of 25.78468 secs

A simple table will provide the count of the final importance decision. We see that we could safely eliminate half of the features:

  > table(feature.selection$finalDecision)
  Tentative Confirmed Rejected 
      12    31    17

Using these results, it is simple to create a new dataframe with our selected features. We start out using the getSelectedAttributes() function to capture the feature names. In this example, let's only select those that are confirmed. If we wanted to include confirmed and tentative, we just specify withTentative = TRUE in the function:

  > fNames <- getSelectedAttributes(feature.selection) # withTentative = TRUE
  > fNames
  [1] "V1" "V4" "V5" "V9" "V10" "V11" "V12" "V13" "V15" "V16"
  [11] "V17" "V18" "V19" "V20" "V21" "V22" "V23" "V27" "V28" "V31"
  [21] "V35" "V36" "V37" "V44" "V45" "V46" "V47" "V48" "V49" "V51"
  [31] "V52"

Using the feature names, we create our subset of the Sonar data:

  > Sonar.features <- Sonar[, fNames]
  > dim(Sonar.features)
  [1] 208 31

There you have it! The Sonar.features dataframe includes all the confirmed features from the boruta algorithm. It can now be subjected to further meaningful data exploration and analysis. A few lines of code and some patience as the algorithm does its job can significantly improve your modeling efforts and insight generation.

Table of Contents for Feature Selection with random forests

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature Selection with random forests