Important concepts in predictive modeling

We already looked at several concepts when we talked about the machine learning pipeline. In this section, we will look at typical terms which are used in predictive modeling, and also discuss about model building and evaluation concepts in detail.

Preparing the data

The data preparation step, as discussed earlier, involves preparing the datasets necessary for feature selection and building the predictive models using the data. We frequently use the following terms in this context:

  • Datasets: They are typically a collection of data points or observations. Most datasets usually correspond to some form of structured data which involves a two dimensional data structure, such as a data matrix or data table (in R this is usually represented using a data frame) containing various values. An example is our german_credit_dataset.csv file from Chapter 5, Credit Risk Detection and Prediction – Descriptive Analytics.
  • Data observations: They are the rows in a dataset where each row consists of a set of observations against a set of attributes. These rows are also often called tuples. For our dataset, each row containing information about a customer is a good example.
  • Data features: They are the columns in a dataset which describe each row in the dataset. These features are often called attributes or variables. Features such as credit.rating, account.balance, and so on form the features of our credit risk dataset.
  • Data transformation: It refers to the act of transforming various data features as needed based on observations from descriptive analytics. Data type conversions, missing values imputation, and scaling and normalization are some of the most used techniques. Also, for categorical variables, if your algorithms are not able to detect the different levels in the variable, you need to convert it to several dummy variables; this process is known as one-hot encoding.
  • Training data: It refers to the data which is solely used to train the predictive models. The machine learning algorithm picks up the tuples from this dataset and tries to find out patterns and learn from the various observation instances.
  • Testing data: It refers to the data which is fed to the predictive model to get predictions and then we check the accuracy of the model using the class labels which are already present in the tuples for this dataset. We never train the model with the testing data because it would bias the model and give incorrect evaluations.

Building predictive models

We build the actual predictive models using machine learning algorithms and data features which finally start giving out predictions as we feed it new data tuples. Some concepts associated with building predictive models are as follows:

  • Model training: It is analogous to building the predictive model where we use a supervised machine learning algorithm and feed the training data features to it and build the predictive model.
  • Predictive model: It is based on some machine learning algorithm, which is essentially a mathematical model at heart, with some assumptions, formulae, and parameter values.
  • Model selection: It is a process where the main objective is to select a predictive model from several iterations of predictive models. The criteria for selecting the best model can vary, depending on the metrics we want to choose, such as maximizing the accuracy, minimizing the error rate, or getting the maximum AUC, which is something we will discuss later. Cross-validation is a good way to run this iterative process.
  • Hyperparameter optimization: It is basically trying to choose a set of the hyperparameters used by the algorithm in the model such that the performance of the model is optimal with regards to its prediction accuracy. This is usually done by a grid search algorithm.
  • Cross validation: It is a model validation technique which is used to estimate how a model would perform in a generic fashion. It is mainly used in iterative processes where the end goal is to optimize the model and make sure it is not over fit to the data so that the model can generalize well with new data and make good predictions. Usually, several rounds of cross validation are run iteratively. Each round of cross validation involves splitting the data into train and test sets; using the training data to train the model and then evaluating its performance with the test set. At the end of this, we get a model which is the best of the lot.

Evaluating predictive models

The most important part in predictive modeling is testing whether the models created are actually useful. This is done by evaluating the models on the testing data and using various metrics to measure the performance of the model. We will discuss some popular model evaluation techniques here. To explain the concepts clearly, we will consider an example with our data. Let us assume we have 100 customers and 40 of them have a bad credit rating with class label 0 and the remaining 60 have a good credit rating with class label 1 in the test data. Let us now assume that our model predicts 22 instances out of the 40 bad instances as bad and the remaining 18 as good. The model also predicts 40 instances out of the 60 good customers as good and the remaining 20 as bad. We will now see how we evaluate the model performance with different techniques:

  • Prediction values: They are usually discrete values which belong to a specific class or category and are often known as class labels. In our case, it is a binary classification problem where we deal with two classes where label 1 indicates customers with good credit rating and 0 indicates bad credit rating.
  • Confusion matrix: It is a nice way to see how the model is predicting the different classes. It is a contingency table with usually two rows and two columns for a binary classification problem like ours. It reports the number of predicted instances in each class against the actual class values. For our preceding example, the confusion matrix would be a 2x2 matrix where two rows would indicate the predicted class labels and two columns would indicate the actual class labels. The total number of predictions with the bad (0) class label which are actually having the bad label is called True Negative (TN) and the remaining bad instances wrongly predicted as good are called False Positive (FP). Correspondingly, the total number of predictions with the good (1) class label that are actually labeled as good are called True Positive (TP) and the remaining good instances wrongly predicted as bad are called False Negative (FN).

    We will depict this in the following figure and discuss some important metrics derived from the confusion matrix, also depicted in the same figure:

    Evaluating predictive models

In the preceding figure, the values which are highlighted in the 2x2 matrix are the ones which were correctly predicted by our model. The ones in white were incorrectly predicted by the model. We can therefore infer the following measures quite easily: TN is 22, FP is 18, TP is 40, and FN is 20. Total N is 40 and total P is 60, which add up to 100 customers in our example dataset.

Specificity is also known as true negative rate, and can be represented by the formula Evaluating predictive models, which gives us the proportion of total true negatives correctly predicted by the total number of instances which are actually negative. In our case, we have a specificity of 55%.

Sensitivity, also known as true positive rate and recall, has the formula Evaluating predictive models, which indicates the proportion of total true positives correctly predicted by the total number of instances which are actually positive. Our example has a sensitivity of 67%.

Precision, also known as positive predictive value, has the formula Evaluating predictive models, which indicates the number of actual positive instances out of all the positive predictions. Our example has a precision of 69%.

Negative predictive value has the formula Evaluating predictive models, which indicates the number of actual negative instances out of all the negative predictions. Our example has an NPV of 52%.

False positive rate, also known as fall-out, is basically the inverse of specificity; where the formula is Evaluating predictive models, which indicates the number of false positive predictions out of all the negative instances. Our example has an FPR of 45%.

False Negative Rate, also known as miss rate, is basically the inverse of sensitivity; where the formula is Evaluating predictive models, which indicates the number of false negative predictions out of all the positive instances. Our example has an FNR of 33%.

Accuracy is basically the metric which denotes how accurate the model is in making predictions, where the formula is Evaluating predictive models. Our prediction accuracy is 62%.

F1 score is another metric of measuring a model's accuracy. It takes into account both the precision and recall values by computing the harmonic mean of the values, depicted by the formula Evaluating predictive models. Our model has an f1 score of 68%.

A Receiver Operator Characteristic (ROC) curve is basically a plot which is used to visualize the model performance as we vary its threshold. The ROC plot is defined by the FPR and TPR as the x and y axes respectively, and each prediction sample can be fit as a point in the ROC space. Perfect plot would involve a TPR of 1 and an FPR of 0 for all the data points. An average model or a baseline model would be a diagonal straight line from (0, 0) to (1, 1) indicating both values to be 0.5. If our model has an ROC curve above the base diagonal line, it indicates that it is performing better than the baseline. The following figure explains how a typical ROC curve looks like in general:

Evaluating predictive models

Area under curve (AUC) is basically the area under the ROC curve obtained from the model evaluation. The AUC is a value which indicates the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. Therefore, the higher the AUC, the better it is. Do check out the file performance_plot_utils.R (shared with the code bundle of the chapter), which has some utility functions to plot and depict these values that we will be using later when we evaluate our model.

This should give you enough background on important terms and concepts related to predictive modeling, and now we will start with our predictive analysis on the data!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset