Modeling using decision trees

Decision trees are algorithms which again belong to the supervised machine learning algorithms family. They are also used for both classification and regression, often called CART, which stands for classification and regression trees. These are used a lot in decision support systems, business intelligence, and operations research.

Decision trees are mainly used for making decisions that would be most useful in reaching some objective and designing a strategy based on these decisions. At the core, a decision tree is just a flowchart with several nodes and conditional edges. Each non-leaf node represents a conditional test on one of the features and each edge represents an outcome of the test. Each leaf node represents a class label where predictions are made for the final outcome. Paths from the root to all the leaf nodes give us all the classification rules. Decision trees are easy to represent, construct, and understand. However, the drawback is that they are very prone to overfitting and often these models do not generalize well. We will follow a similar analytics pipeline as before, to build some models based on decision trees.

We start with loading the necessary dependencies and test data features:

> library(rpart)# tree models 
> library(caret) # feature selection
> library(rpart.plot) # plot dtree
> library(ROCR) # model evaluation
> library(e1071) # tuning model
> source("performance_plot_utils.R") # plotting curves
> ## separate feature and class variables
> test.feature.vars <- test.data[,-1]
> test.class.var <- test.data[,1]

Now we will build an initial model with all the features as follows:

> formula.init <- "credit.rating ~ ."
> formula.init <- as.formula(formula.init)
> dt.model <- rpart(formula=formula.init, 
                    method="class",data=train.data,control = 
                           rpart.control(minsplit=20, cp=0.05))

We predict and evaluate the model on the test data with the following code:

> dt.predictions <- predict(dt.model, test.feature.vars, 
                           type="class")
> confusionMatrix(data=dt.predictions, reference=test.class.var, 
                   positive="1")

From the following output, we see that the model accuracy is around 68%, sensitivity is 92%, which is excellent, but specificity is only 18%, which we should try and improve:

Modeling using decision trees

We will now try feature selection to improve the model. We use the following code to train the model and rank the features by their importance:

> formula.init <- "credit.rating ~ ."
> formula.init <- as.formula(formula.init)
> control <- trainControl(method="repeatedcv", number=10, repeats=2)
> model <- train(formula.init, data=train.data, method="rpart", 
+                trControl=control)
> importance <- varImp(model, scale=FALSE)
> plot(importance)

This gives us the following plot showing the importance of different features:

Modeling using decision trees

If you observe closely, the decision tree does not use all the features in the model construction and the top five features are the same as those we obtained earlier when we talked about feature selection. We will now build a model using these features as follows:

> formula.new <- "credit.rating ~ account.balance + savings +
                                  credit.amount +  
                                  credit.duration.months + 
                                  previous.credit.payment.status"
> formula.new <- as.formula(formula.new)
> dt.model.new <- rpart(formula=formula.new, method="class",data=train.data, 
+                   control = rpart.control(minsplit=20, cp=0.05),
+                   parms = list(prior = c(0.7, 0.3)))

We now make predictions on the test data and evaluate it, as follows:

> dt.predictions.new <- predict(dt.model.new, test.feature.vars, 
                                type="class")
> confusionMatrix(data=dt.predictions.new, 
                  reference=test.class.var, positive="1")

This gives us the following confusion matrix with other metrics:

Modeling using decision trees

You can see now that the overall model accuracy has decreased a bit to 62%. However, we have increased our bad credit rating prediction where we predict a 100 bad credit rating customers out of 130, which is excellent! Consequently, specificity has jumped up to 77% and sensitivity is down to 55%, but we still classify a substantial number of good credit rating customers as good. Though this model is a bit aggressive, it is a reasonable model because though we deny credit loans to more customers who could default on the payment, we also make sure a reasonable number of good customers get their credit loans approved.

The reason we obtained these results is because we have built the model with a parameter called prior, if you check the modeling section earlier. This prior basically empowers us to apply weightages to the different classes in the class variable. If you remember, we had 700 people with a good credit rating and 300 people with a bad credit rating in our dataset, which was highly skewed, so while training the model, we can use prior to specify the importance of each of the classes in this variable and thus adjust the importance of misclassification of each class. In our model, we give more importance to the bad credit rating customers.

You can reverse the priors and give more importance to the good rating customers by using the parameter as prior = c(0.7, 0.3), which would give the following confusion matrix:

Modeling using decision trees

You can clearly see now that, since we gave more importance to good credit ratings, the sensitivity has jumped up to 92% and specificity has gone down to 18%. You can see that this gives you a lot of flexibility over your modeling depending on what you want to achieve.

To view the model, we can use the following code snippet:

> dt.model.best <- dt.model.new
> print(dt.model.best)

Output:

Modeling using decision trees

To visualize the preceding tree, you can use the following:

> par(mfrow=c(1,1))
> prp(dt.model.best, type=1, extra=3, varlen=0, faclen=0)

This gives us the following tree, and we can see that, using the priors, the only feature that is being used now out of the five features is account.balance and it has ignored all the other features. You can try and optimize the model further by using hyperparameter tuning by exploring the tune.rpart function from the e1071 package:

Modeling using decision trees

We finish our analysis by plotting some metric evaluation curves as follows:

> dt.predictions.best <- predict(dt.model.best, test.feature.vars, 
                                  type="prob")
> dt.prediction.values <- dt.predictions.best[,2]
> predictions <- prediction(dt.prediction.values, test.class.var)
> par(mfrow=c(1,2))
> plot.roc.curve(predictions, title.text="DT ROC Curve")
> plot.pr.curve(predictions, title.text="DT Precision/Recall 
                Curve")

The AUC is around 0.66, which is not the best but definitely better than the baseline denoted by the red line in the following plot:

Modeling using decision trees

Based on our business requirements, this model is quite fair. We will discuss model comparison later on in this chapter. We will now use random forests to build our next set of predictive models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset