We already looked at several concepts when we talked about the machine learning pipeline. In this section, we will look at typical terms which are used in predictive modeling, and also discuss about model building and evaluation concepts in detail.
The data preparation step, as discussed earlier, involves preparing the datasets necessary for feature selection and building the predictive models using the data. We frequently use the following terms in this context:
german_credit_dataset.csv
file from Chapter 5, Credit Risk Detection and Prediction – Descriptive Analytics.credit.rating
, account.balance
, and so on form the features of our credit risk dataset.We build the actual predictive models using machine learning algorithms and data features which finally start giving out predictions as we feed it new data tuples. Some concepts associated with building predictive models are as follows:
The most important part in predictive modeling is testing whether the models created are actually useful. This is done by evaluating the models on the testing data and using various metrics to measure the performance of the model. We will discuss some popular model evaluation techniques here. To explain the concepts clearly, we will consider an example with our data. Let us assume we have 100 customers and 40 of them have a bad credit rating with class label 0 and the remaining 60 have a good credit rating with class label 1 in the test data. Let us now assume that our model predicts 22 instances out of the 40 bad instances as bad and the remaining 18 as good. The model also predicts 40 instances out of the 60 good customers as good and the remaining 20 as bad. We will now see how we evaluate the model performance with different techniques:
We will depict this in the following figure and discuss some important metrics derived from the confusion matrix, also depicted in the same figure:
In the preceding figure, the values which are highlighted in the 2x2 matrix are the ones which were correctly predicted by our model. The ones in white were incorrectly predicted by the model. We can therefore infer the following measures quite easily: TN is 22, FP is 18, TP is 40, and FN is 20. Total N is 40 and total P is 60, which add up to 100 customers in our example dataset.
Specificity is also known as true negative rate, and can be represented by the formula , which gives us the proportion of total true negatives correctly predicted by the total number of instances which are actually negative. In our case, we have a specificity of 55%.
Sensitivity, also known as true positive rate and recall, has the formula , which indicates the proportion of total true positives correctly predicted by the total number of instances which are actually positive. Our example has a sensitivity of 67%.
Precision, also known as positive predictive value, has the formula , which indicates the number of actual positive instances out of all the positive predictions. Our example has a precision of 69%.
Negative predictive value has the formula , which indicates the number of actual negative instances out of all the negative predictions. Our example has an NPV of 52%.
False positive rate, also known as fall-out, is basically the inverse of specificity; where the formula is , which indicates the number of false positive predictions out of all the negative instances. Our example has an FPR of 45%.
False Negative Rate, also known as miss rate, is basically the inverse of sensitivity; where the formula is , which indicates the number of false negative predictions out of all the positive instances. Our example has an FNR of 33%.
Accuracy is basically the metric which denotes how accurate the model is in making predictions, where the formula is . Our prediction accuracy is 62%.
F1 score is another metric of measuring a model's accuracy. It takes into account both the precision and recall values by computing the harmonic mean of the values, depicted by the formula . Our model has an f1 score of 68%.
A Receiver
Operator Characteristic (ROC) curve is basically a plot which is used to visualize the model performance as we vary its threshold. The ROC plot is defined by the FPR and TPR as the x and y axes respectively, and each prediction sample can be fit as a point in the ROC space. Perfect plot would involve a TPR of 1 and an FPR of 0 for all the data points. An average model or a baseline model would be a diagonal straight line from (0, 0) to (1, 1) indicating both values to be 0.5
. If our model has an ROC curve above the base diagonal line, it indicates that it is performing better than the baseline. The following figure explains how a typical ROC curve looks like in general:
Area under curve (AUC) is basically the area under the ROC curve obtained from the model evaluation. The AUC is a value which indicates the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. Therefore, the higher the AUC, the better it is. Do check out the file performance_plot_utils.R
(shared with the code bundle of the chapter), which has some utility functions to plot and depict these values that we will be using later when we evaluate our model.
This should give you enough background on important terms and concepts related to predictive modeling, and now we will start with our predictive analysis on the data!