Modeling using random forests

Random forests, also known as random decision forests, are a machine learning algorithm that comes from the family of ensemble learning algorithms. It is used for both regression and classification tasks. Random forests are nothing but a collection or ensemble of decision trees, hence the name.

The working of the algorithm can be described briefly as follows. At any point in time, each tree in the ensemble of decision trees is built from a bootstrap sample, which is basically sampling with replacement. This sampling is done on the training dataset. During the construction of the decision tree, the split which was earlier being chosen as the best split among all the features is not done anymore. Now the best split is always chosen from a random subset of the features each time. This introduction of randomness into the model increases the bias of the model slightly but decreases the variance of the model greatly which prevents the overfitting of models, which is a serious concern in the case of decision trees. Overall, this yields much better performing generalized models. We will now start our analytics pipeline by loading the necessary dependencies:

> library(randomForest) #rf model 
> library(caret) # feature selection
> library(e1071) # model tuning
> library(ROCR) # model evaluation
> source("performance_plot_utils.R") # plot curves
> ## separate feature and class variables
> test.feature.vars <- test.data[,-1]
> test.class.var <- test.data[,1]

Next, we will build the initial training model with all the features as follows:

> formula.init <- "credit.rating ~ ."
> formula.init <- as.formula(formula.init)
> rf.model <- randomForest(formula.init, data = train.data, 
                           importance=T, proximity=T)

You can view the model details by using the following code:

> print(rf.model)

Output:

Modeling using random forests

This gives us information about the out of bag error (OOBE), which is around 23%, and the confusion matrix which is calculated on the training data, and also how many variables it is using at each split.

Next, we will perform predictions using this model on the test data and evaluate them:

> rf.predictions <- predict(rf.model, test.feature.vars, 
                            type="class")
> confusionMatrix(data=rf.predictions, reference=test.class.var, 
                  positive="1")

The following output depicts that we get an overall accuracy of 73%, sensitivity of 91%, and specificity of 36%:

Modeling using random forests

The initial model yields quite decent results. We see that a fair amount of bad credit rating customers are classified as bad and most of the good rating based customers are rated as good.

We will now build a new model with the top five features from the feature selection section, where we had used the random forest algorithm itself for getting the best features. The following code snippet builds the new model:

formula.new <- "credit.rating ~ account.balance + savings +
                                credit.amount +  
                                credit.duration.months + 
                                previous.credit.payment.status"
formula.new <- as.formula(formula.new)
rf.model.new <- randomForest(formula.new, data = train.data, 
                         importance=T, proximity=T)

We now make predictions with this model on the test data and evaluate its performance as follows:

> rf.predictions.new <- predict(rf.model.new, test.feature.vars, 
                                 type="class")
> confusionMatrix(data=rf.predictions.new,   reference=test.class.var, positive="1")

This gives us the following confusion matrix as the output with the other essential performance metrics:

Modeling using random forests

We get a slightly decreased accuracy of 71%, which is obvious because we have eliminated many features, but now the specificity has increased to 42%, which indicates it is able to classify more bad instances correctly as bad. Sensitivity has decreased slightly to 84%. We will now use grid search to perform hyperparameter tuning on this model as follows, to see if we can improve the performance further. The parameters of interest here include ntree, indicating the number of trees, nodesize, indicating the minimum size of terminal nodes, and mtry, indicating the number of variables sampled randomly at each split.

nodesize.vals <- c(2, 3, 4, 5)
ntree.vals <- c(200, 500, 1000, 2000)
tuning.results <- tune.randomForest(formula.new, 
                             data = train.data,
                             mtry=3, 
                             nodesize=nodesize.vals,
                             ntree=ntree.vals)
print(tuning.results)

Output:

Modeling using random forests

We now get the best model from the preceding grid search, perform predictions on the test data, and evaluate its performance with the following code snippet:

> rf.model.best <- tuning.results$best.model
> rf.predictions.best <- predict(rf.model.best, test.feature.vars, 
                                 type="class")
> confusionMatrix(data=rf.predictions.best,
                  reference=test.class.var, positive="1")

We can make several observations from the following output. Performance has improved very negligibly as the overall accuracy remains the same at 71% and specificity at 42%. Sensitivity has increased slightly to 85% from 84%:

Modeling using random forests

We now plot some performance curves for this model, as follows:

> rf.predictions.best <- predict(rf.model.best, test.feature.vars, type="prob")
> rf.prediction.values <- rf.predictions.best[,2]
> predictions <- prediction(rf.prediction.values, test.class.var)
> par(mfrow=c(1,2))
> plot.roc.curve(predictions, title.text="RF ROC Curve")
> plot.pr.curve(predictions, title.text="RF Precision/Recall Curve")

We observe that the total AUC is about 0.7 and is much better than the red baseline AUC of 0.5 in the following plot:

Modeling using random forests

The last algorithm we will explore is neural networks and we will build our models using them in the following section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset