Feature selection

The process of feature selection involves ranking variables or features according to their importance by training a predictive model using them and then trying to find out which variables were the most relevant features for that model. While each model often has its own set of important features, for classification we will use a random forest model here to try and figure out which variables might be of importance in general for classification-based predictions.

We perform feature selection for several reasons, which include:

  • Removing redundant or irrelevant features without too much information loss
  • Preventing overfitting of models by using too many features
  • Reducing variance of the model which is contributed from excess features
  • Reducing training time and converging time of models
  • Building simple and easy to interpret models

We will be using a recursive feature elimination algorithm for feature selection and an evaluation algorithm using a predictive model where we repeatedly construct several machine learning models with different features in different iterations. At each iteration, we keep eliminating irrelevant or redundant features and check the feature subset for which we get maximum accuracy and minimum error. Since this is an iterative process and follows the principle of the popular greedy hill climbing algorithm, an exhaustive search with a global optima outcome is generally not possible and depending on the starting point, we may end up at local optima with a subset of features which may be different from the subset of features we obtain in a different run. However, most of the features in the obtained subset will usually be constant if we run it several times using cross-validation. We will use the random forest algorithm, which we will explain in more detail later on. For now, just remember it is an ensemble learning algorithm that uses several decision trees at each stage in its training process. This tends to reduce variance and overfitting with a small increase towards bias of the model since we introduce some randomness into this process at each stage in the algorithm.

The code for this section is present in the feature_selection.R file. We will first load the necessary libraries. Install them in case you do not have them installed, as we did in the previous chapters:

> library(caret)  # feature selection algorithm
> library(randomForest) # random forest algorithm

Now we define the utility function for feature selection using recursive feature elimination and random forests for the model evaluation in the following code snippet. Remember to run it in the R console to load into memory for using it later:

run.feature.selection <- function(num.iters=20, feature.vars, class.var){
  set.seed(10)
  variable.sizes <- 1:10
  control <- rfeControl(functions = rfFuncs, method = "cv", 
                        verbose = FALSE, returnResamp = "all", 
                        number = num.iters)
  results.rfe <- rfe(x = feature.vars, y = class.var, 
             sizes = variable.sizes, 
             rfeControl = control)
  return(results.rfe)
}

By default, the preceding code uses cross-validation where the data is split into training and test sets. For each iteration, recursive feature elimination takes place and the model is trained and tested for accuracy and errors on the test set. The data partitions keep changing randomly for every iteration to prevent overfitting of the model and ultimately give a generalized estimate of how the model would perform in a generic fashion. If you observe, our function runs it for 20 iterations by default. Remember, in our case, we always train on the training data which is internally partitioned for cross-validation by the function. The variable feature.vars indicate all the independent feature variables that can be accessed in the training dataset using the train.data[,-1] subsetting, and to access the class.var,which indicates the class variable to be predicted, we subset using train.data[,1].

Note

We do not touch the test data at all because we will be using it only for predictions and model evaluations. Therefore, we would not want to influence the model by using that data since it would lead to incorrect evaluations.

We now run the algorithm using our defined function on the training data using the following code. It may take some time to run, so be patient if you see that R is taking some time to return the results:

rfe.results <- run.feature.selection(feature.vars=train.data[,-1], 
                                     class.var=train.data[,1])
# view results
rfe.results

On viewing the results, we get the following output:

Feature selection

From the output, you can see that it has found a total of 10 features that were the most important out of the 20 and it has returned the top five features by default. You can play around with this result variable even further and see all the variables with their importance by using the varImp(rfe.results) command in the R console. The values and importance values may differ for you because the training and test data partitions are done randomly, if you remember, so do not panic if you see different values from the screenshot. However, the top five features will usually remain consistent based on our observations. We will now start building predictive models using the different machine learning algorithms for the next stage of our analytics pipeline. However, do remember that since the training and test sets are randomly chosen, your sets might give slightly different results than what we depict here when we performed these experiments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset