Model selection

What are we to make of all this? We have the confusion matrices from our models to guide us, but we can get a little more sophisticated when it comes to selecting the classification models. An effective tool for a classification model comparison is the Receiver Operating Characteristic (ROC) chart. Very simply, ROC is a technique for visualizing, organizing, and selecting the classifiers based on their performance (Fawcett, 2006). On the ROC chart, the y-axis is the True Positive Rate (TPR) and the x-axis is the False Positive Rate (FPR). The following are the calculations, which are quite simple:

  • TPR = Positives correctly classified / total positives
  • FPR = Negatives incorrectly classified / total negatives

Plotting the ROC results will generate a curve, and thus, you are able to produce the Area Under the Curve (AUC). The AUC provides you with an effective indicator of performance and it can be shown that the AUC is equal to the probability that the observer will correctly identify the positive case when presented with a randomly chosen pair of cases in which one case is positive and one case is negative (Hanley JA & McNeil BJ, 1982). In our case, we will just switch the observer with our algorithms and evaluate accordingly.

To create an ROC chart in R, you can use the ROCR package. I think it is a great package that allows you to build a chart in just three lines of code. The package also has an excellent companion website with examples and a presentation that can be found at http://rocr.bioinf.mpi-sb.mpg.de/.

What I want to show is three different plots on our ROC chart: the full model, the reduced model using BIC to select the features, and a bad model. This so-called bad model will include just one predictive feature and will provide an effective contrast to our other two models. Therefore, let's load the ROCR package and build this poorly performing model and call it bad.fit on the test data for simplicity, using the thick feature as follows:

> library(ROCR)

> bad.fit = glm(class~thick, family=binomial, data=test)

> test$bad.probs = predict(bad.fit, type="response") #save probabilities

It is now possible to build the ROC chart with three lines of code per model using the test dataset. We will first create an object that saves the predicted probabilities with the actual classification. Next, we will use this object to create another object with the calculated TPR and FPR. Then, we will build the chart with the plot() function. Let's get started with the model using all of the features or—as I call it—the full model. This was the initial one that we built back in the Logistic regression model section of this chapter:

> pred.full = prediction(test$prob, test$class)

The following is the performance object with the TPR and FPR:

> perf.full = performance(pred.full, "tpr", "fpr")

The following plot command with the title of ROC and col=1 will color the line black:

> plot(perf.full, main="ROC", col=1)

The output of the preceding command is as follows:

Model selection

As stated previously, the curve represents TPR on the y-axis and FPR on the x-axis. If you have the perfect classifier with no false positives, then the line will run vertical at 0.0 on the x-axis. If a model is no better than chance, then the line will run diagonally from the lower left corner to the upper right one. As a reminder, the full model missed out on five labels: three false positives and two false negatives. We can now add the other models for comparison using a similar code, starting with the model built using BIC (refer to the Logistic regression with cross-validation section of this chapter), as follows:

> pred.bic = prediction(test$bic.probs, test$class)

> perf.bic = performance(pred.bic, "tpr", "fpr")

> plot(perf.bic, col=2, add=TRUE)

The add=TRUE parameter in the plot command added the line to the existing chart. Finally, we will add the poor performing model and include a legend chart, as follows:

> pred.bad = prediction(test$bad, test$class)

> perf.bad = performance(pred.bad, "tpr", "fpr")

> plot(perf.bad, col=3, add=TRUE)

> legend(0.6, 0.6, c("FULL", "BIC", "BAD"),1:3)

We can see that the FULL model and BIC model are nearly superimposed. As you may recall, previously the only difference in the confusion matrices was the fact that the BIC model had one false positive more and one false negative less. It is also quite clear that the BAD model performed as poorly as was expected, which can be seen in the following image:

Model selection

The final thing that we can do here is compute the AUC. This is again done in the ROCR package with the creation of a performance object, except that you have to substitute auc for tpr and fpr. The code and output is as follows:

> auc.full = performance(pred.full, "auc")

> auc.full
An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.9972672

Slot "alpha.values":
list()

The values that we are looking for are under the Slot "y.values" section of the output. The AUC for the full model is 0.997. I've abbreviated the output for the other two models of interest, as follows:

> auc.bic = performance(pred.bic, "auc")

> auc.bic

Slot "y.values":
[[1]]
[1] 0.9944293

> auc.bad = performance(pred.bad, "auc")

> auc.bad

Slot "y.values":
[[1]]
[1] 0.8962056

The AUCs were 99.7 percent for the full model, 99.4 percent for the BIC model, and 89.6 percent for the bad model. So, for all intents and purposes, the full model and the BIC model have no difference in predictive powers between them. What are we to do? A simple solution would be to rerandomize the train and test sets and try this analysis again, perhaps using a 60/40 split and different randomization seed. However, if we end up with a similar result, then what? I think a statistical purist would recommend selecting the most parsimonious model, while others may be more inclined to include all the variables. It comes down to trade-offs, that is, model accuracy versus interpretability, simplicity, and scalability. In this instance, it seems safe to default to the simpler model, which has the same accuracy. Maybe there is another option? Let me propose that we can tackle this problem in the upcoming chapters with more complex techniques and improve our predictive ability. The beauty of machine learning is that there are several ways to skin the proverbial cat.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset