Chapter 5. Classification

Data scientists are often faced with a problem that requires an automated decision. Is an email an attempt at phishing? Is a customer likely to churn? Is the web user likely to click on an advertisement? These are all classification problems. Classification is perhaps the most important form of prediction: the goal is to predict whether a record is a 0 or a 1 (phishing/not-phishing, click/don’t click, churn/don’t churn), or in some cases, one of several categories (for example, Gmail’s filtering of your inbox into “primary,” “social,” “promotional,” or “forums”).

Often, we need more than a simple binary classification: we want to know the predicted probability that a case belongs to a class.

Rather than having a model simply assign a binary classification, most algorithms can return a probability score (propensity) of belonging to the class of interest. In fact, with logistic regression, the default output from R is on the log-odds scale, and this must be transformed to a propensity. A sliding cutoff can then be used to convert the propensity score to a decision. The general approach is as follows:

  1. Establish a cutoff probability for the class of interest above which we consider a record as belonging to that class.

  2. Estimate (with any model) the probability that a record belongs to the class of interest.

  3. If that probability is above the cutoff probability, assign the new record to the class of interest.

The higher the cutoff, the fewer records predicted as 1—that is, belonging to the class of interest. The lower the cutoff, the more records predicted as 1.

This chapter covers several key techniques for classification and estimating propensities; additional methods that can be used both for classification and numerical prediction are described in the next chapter.

Naive Bayes

The naive Bayes algorithm uses the probability of observing predictor values, given an outcome, to estimate the probability of observing outcome Y = i, given a set of predictor values.1

To understand Bayesian classification, we can start out by imagining “non-naive” Bayesian classification. For each record to be classified:

  1. Find all the other records with the same predictor profile (i.e., where the predictor values are the same).

  2. Determine what classes those records belong to and which class is most prevalent (i.e., probable).

  3. Assign that class to the new record.

The preceding approach amounts to finding all the records in the sample that are exactly like the new record to be classified in the sense that all the predictor values are identical.

Note

Predictor variables must be categorical (factor) variables in the standard naive Bayes algorithm. See “Numeric Predictor Variables” for two workarounds for using continuous variables.

Why Exact Bayesian Classification Is Impractical

When the number of predictor variables exceeds a handful, many of the records to be classified will be without exact matches. This can be understood in the context of a model to predict voting on the basis of demographic variables. Even a sizable sample may not contain even a single match for a new record who is a male Hispanic with high income from the US Midwest who voted in the last election, did not vote in the prior election, has three daughters and one son, and is divorced. And this is just eight variables, a small number for most classification problems. The addition of just a single new variable with five equally frequent categories reduces the probability of a match by a factor of 5.

Warning

Despite its name, naive Bayes is not considered a method of Bayesian statistics. Naive Bayes is a data–driven, empirical method requiring relatively little statistical expertise. The name comes from the Bayes rule–like calculation in forming the predictions—specifically the initial calculation of predictor value probabilities given an outcome, and then the final calculation of outcome probabilities.

The Naive Solution

In the naive Bayes solution, we no longer restrict the probability calculation to those records that match the record to be classified. Instead, we use the entire data set. The naive Bayes modification is as follows:

  1. For a binary response Y = i (i = 0 or 1), estimate the individual conditional probabilities for each predictor P(Xj|Y=i); these are the probabilities that the predictor value is in the record when we observe Y = i. This probability is estimated by the proportion of Xj values among the Y = i records in the training set.

  2. Multiply these probabilities by each other, and then by the proportion of records belonging to Y = i.

  3. Repeat steps 1 and 2 for all the classes.

  4. Estimate a probability for outcome i by taking the value calculated in step 2 for class i and dividing it by the sum of such values for all classes.

  5. Assign the record to the class with the highest probability for this set of predictor values.

This naive Bayes algorithm can also be stated as an equation for the probability of observing outcome Y = i, given a set of predictor values X1,,Xp:

P(X1,X2,,Xp)

The value of P(X1,X2,,Xp) is a scaling factor to ensure the probability is between 0 and 1 and does not depend on Y:

P(X1,X2,,Xp)=P(Y=0)P(X1Y=0)P(X2Y=0)P(XpY=0)+P(Y=1)P(X1Y=1)P(X2Y=1)P(XpY=1)

Why is this formula called “naive”? We have made a simplifying assumption that the exact conditional probability of a vector of predictor values, given observing an outcome, is sufficiently well estimated by the product of the individual conditional probabilities P(Xj|Y=i). In other words, in estimating P(Xj|Y=i) instead of P(X1,X2,Xp|Y=i), we are assuming Xj is independent of all the other predictor variables Xk for kj.

Several packages in R can be used to estimate a naive Bayes model. The following fits a model using the klaR package:

library(klaR)
naive_model <- NaiveBayes(outcome ~ purpose_ + home_ + emp_len_,
                          data = na.omit(loan_data))
naive_model$table
$purpose_
          var
grouping   credit_card debt_consolidation home_improvement major_purchase
  paid off   0.1875965          0.5521591       0.07150104     0.05359270
  default    0.1515152          0.5757135       0.05981209     0.03727229
          var
grouping      medical      other small_business
  paid off 0.01424728 0.09990737     0.02099599
  default  0.01433549 0.11561025     0.04574126

$home_
          var
grouping   MORTGAGE       OWN      RENT
  paid off 0.489480 0.0808963 0.4296237
  default  0.431344 0.0832782 0.4853778

$emp_len_
          var
grouping     < 1 Year  > 1 Year
  paid off 0.03105289 0.9689471
  default  0.04728508 0.9527149

The output from the model is the conditional probabilities P(Xj|Y=i). The model can be used to predict the outcome of a new loan:

new_loan <- loan_data[147, c('purpose_', 'home_', 'emp_len_')]
row.names(new_loan) <- NULL
new_loan
       	 purpose_    home_  emp_len_
	1 small_business MORTGAGE  > 1 Year

In this case, the model predicts a default:

predict(naive_model, new_loan)
$class
[1] default
Levels: paid off default

$posterior
      paid off   default
[1,] 0.3463013 0.6536987

The prediction also returns a posterior estimate of the probability of default. The naive Bayesian classifier is known to produce biased estimates. However, where the goal is to rank records according to the probability that Y = 1, unbiased estimates of probability are not needed and naive Bayes produces good results.

Numeric Predictor Variables

From the definition, we see that the Bayesian classifier works only with categorical predictors (e.g., with spam classification, where presence or absence of words, phrases, characters, and so on, lies at the heart of the predictive task). To apply naive Bayes to numerical predictors, one of two approaches must be taken:

  • Bin and convert the numerical predictors to categorical predictors and apply the algorithm of the previous section.

  • Use a probability model—for example, the normal distribution (see “Normal Distribution”)—to estimate the conditional probability P(Xj|Y=i).

Caution

When a predictor category is absent in the training data, the algorithm assigns zero probability to the outcome variable in new data, rather than simply ignoring this variable and using the information from other variables, as other methods might. This is something to pay attention to when binning continuous variables.

Further Reading

  • Elements of Statistical Learning, 2nd ed., by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer, 2009).

  • There is a full chapter on naive Bayes in Data Mining for Business Analytics, 3rd ed., by Galit Shmueli, Peter Bruce, and Nitin Patel (Wiley, 2016, with variants for R, Excel, and JMP).

Discriminant Analysis

Discriminant analysis is the earliest statistical classifier; it was introduced by R. A. Fisher in 1936 in an article published in the Annals of Eugenics journal.2

While discriminant analysis encompasses several techniques, the most commonly used is linear discriminant analysis, or LDA. The original method proposed by Fisher was actually slightly different from LDA, but the mechanics are essentially the same. LDA is now less widely used with the advent of more sophisticated techniques, such as tree models and logistic regression.

However, you may still encounter LDA in some applications and it has links to other more widely used methods (such as principal components analysis; see “Principal Components Analysis”). In addition, discriminant analysis can provide a measure of predictor importance, and it is used as a computationally efficient method of feature selection.

Warning

Linear discriminant analysis should not be confused with Latent Dirichlet Allocation, also referred to as LDA. Latent Dirichlet Allocation is used in text and natural language processing and is unrelated to linear discriminant analysis.

Covariance Matrix

To understand discriminant analysis, it is first necessary to introduce the concept of covariance between two or more variables. The covariance measures the relationship between two variables x and z. Denote the mean for each variable by x¯ and z¯ (see “Mean”). The covariance sx,z between x and z is given by:

sx,z=i=1n(xi-x¯)(zi-z¯)n-1

where n is the number of records (note that we divide by n – 1 instead of n: see “Degrees of Freedom, and n or n – 1?”).

As with the correlation coefficient (see “Correlation”), positive values indicate a positive relationship and negative values indicate a negative relationship. Correlation, however, is constrained to be between –1 and 1, whereas covariance is on the same scale as the variables x and z. The covariance matrix Σ for x and z consists of the individual variable variances, sx2 and sy2, on the diagonal (where row and column are the same variable) and the covariances between variable pairs on the off-diagonals.

Σ^=sx2sx,zsz,xsz2
Note

Recall that the standard deviation is used to normalize a variable to a z-score; the covariance matrix is used in a multivariate extension of this standardization process. This is known as Mahalanobis distance (see Other Distance Metrics) and is related to the LDA function.

Fisher’s Linear Discriminant

For simplicity, we focus on a classification problem in which we want to predict a binary outcome y using just two continuous numeric variables (x,z). Technically, discriminant analysis assumes the predictor variables are normally distributed continuous variables, but, in practice, the method works well even for nonextreme departures from normality, and for binary predictors. Fisher’s linear discriminant distinguishes variation between groups, on the one hand, from variation within groups on the other. Specifically, seeking to divide the records into two groups, LDA focuses on maximizing the “between” sum of squares SSbetween (measuring the variation between the two groups) relative to the “within” sum of squares SSwithin (measuring the within-group variation). In this case, the two groups correspond to the records (x0,z0) for which y = 0 and the records (x1,z1) for which y = 1. The method finds the linear combination wxx+wzz that maximizes that sum of squares ratio.

SSbetweenSSwithin

The between sum of squares is the squared distance between the two group means, and the within sum of squares is the spread around the means within each group, weighted by the covariance matrix. Intuitively, by maximizing the between sum of squares and minimizing the within sum of squares, this method yields the greatest separation between the two groups.

A Simple Example

The MASS package, associated with the book Modern Applied Statistics with S by W. N. Venables and B. D. Ripley (Springer, 1994), provides a function for LDA with R. The following applies this function to a sample of loan data using two predictor variables, borrower_score and payment_inc_ratio, and prints out the estimated linear discriminator weights.

library(MASS)
loan_lda <- lda(outcome ~ borrower_score + payment_inc_ratio,
                     data=loan3000)
loan_lda$scaling
                          LD1
borrower_score 	   7.17583880
payment_inc_ratio -0.09967559

Using Discriminant Analysis for Feature Selection

If the predictor variables are normalized prior to running LDA, the discriminator weights are measures of variable importance, thus providing a computationally efficient method of feature selection.

The lda function can predict the probability of “default” versus “paid off”:

pred <- predict(loan_lda)
head(pred$posterior)
    default  paid off
1 0.5535437 0.4464563
2 0.5589534 0.4410466
3 0.2726962 0.7273038
4 0.5062538 0.4937462
5 0.6099525 0.3900475
6 0.4107406 0.5892594

A plot of the predictions helps illustrate how LDA works. Using the output from the predict function, a plot of the estimated probability of default is produced as follows:

x <- seq(from=.33, to=.73, length=100)
y <- seq(from=0, to=20, length=100)
newdata <- data.frame(borrower_score=x, payment_inc_ratio=y)
pred <- predict(loan_lda, newdata=newdata)
lda_df0 <- cbind(newdata, outcome=pred$class)

ggplot(data=lda_df, aes(x=borrower_score, y=payment_inc_ratio, color=prob_default)) +
  geom_point(alpha=.6) +
  scale_color_gradient2(low='white', high='blue') +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0), lim=c(0, 20)) +
  geom_line(data=lda_df0, col='green', size=2, alpha=.8)

The resulting plot is shown in Figure 5-1.

LDA prediction of loan default using two variables: a score of the borrower's creditworthiness and the payment to income ratio
Figure 5-1. LDA prediction of loan default using two variables: a score of the borrower’s creditworthiness and the payment to income ratio.

Using the discriminant function weights, LDA splits the predictor space into two regions as shown by the solid line. The predictions farther away from the line have a higher level of confidence (i.e., a probability further away from 0.5).

Extensions of Discriminant Analysis

More predictor variables: while the text and example in this section used just two predictor variables, LDA works just as well with more than two predictor variables. The only limiting factor is the number of records (estimating the covariance matrix requires a sufficient number of records per variable, which is typically not an issue in data science applications).

Quadratic Discriminant Analysis: There are other variants of discriminant analysis. The best known is quadratic discriminant analysis (QDA). Despite its name, QDA is still a linear discriminant function. The main difference is that in LDA, the covariance matrix is assumed to be the same for the two groups corresponding to Y = 0 and Y = 1. In QDA, the covariance matrix is allowed to be different for the two groups. In practice, the difference in most applications is not critical.

Further Reading

  • Elements of Statistical Learning, 2nd ed., by Trevor Hastie, Robert Tibshirani, Jerome Freidman, and its shorter cousin, An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (both from Springer). Both have a section on discriminant analysis.

  • Data Mining for Business Analytics, 3rd ed., by Galit Shmueli, Peter Bruce, and Nitin Patel (Wiley, 2016, with variants for R, Excel, and JMP) has a full chapter on discriminant analysis.

  • For historical interest, Fisher’s original article on the topic, “The Use of Multiple Measurements in Taxonomic Problems,” as published in 1936 in Annals of Eugenics (now called Annals of Genetics) can be found online.

Logistic Regression

Logistic regression is analogous to multiple linear regression, except the outcome is binary. Various transformations are employed to convert the problem to one in which a linear model can be fit. Like discriminant analysis, and unlike K-Nearest Neighbor and naive Bayes, logistic regression is a structured model approach, rather than a data-centric approach. Due to its fast computational speed and its output of a model that lends itself to rapid scoring of new data, it is a popular method.

How do we get from a binary outcome variable to an outcome variable that can be modeled in linear fashion, then back again to a binary outcome?

Logistic Response Function and Logit

The key ingredients are the logistic response function and the logit, in which we map a probability (which is on a 0–1 scale) to a more expansive scale suitable for linear modeling.

The first step is to think of the outcome variable not as a binary label, but as the probability p that the label is a “1.” Naively, we might be tempted to model p as a linear function of the predictor variables:

p=β0+β1x1+β2x2++βqxq.

However, fitting this model does not ensure that p will end up between 0 and 1, as a probability must.

Instead, we model p by applying a logistic response or inverse logit function to the predictors:

p=11+e-(β0+β1x1+β2x2++βqxq).

This transform ensures that the p stays between 0 and 1.

To get the exponential expression out of the denominator, we consider odds instead of probabilities. Odds, familiar to bettors everywhere, are the ratio of “successes” (1) to “nonsuccesses” (0). In terms of probabilities, odds are the probability of an event divided by the probability that the event will not occur. For example, if the probability that a horse will win is 0.5, the probability of “won’t win” is (1 – 0.5) = 0.5, and the odds are 1.0.

Odds(Y=1)=p1-p.

We can obtain the probability from the odds using the inverse odds function:

p=Odds1+Odds.

We combine this with the logistic response function, shown earlier, to get:

Odds(Y=1)=eβ0+β1x1+β2x2++βqxq.

Finally, taking the logarithm of both sides, we get an expression that involves a linear function of the predictors:

log(Odds(Y=1))=β0+β1x1+β2x2++βqxq.

The log-odds function, also known as the logit function, maps the probability p from (0,1) to any value (-,+): see Figure 5-2. The transformation circle is complete; we have used a linear model to predict a probability, which, in turn, we can map to a class label by applying a cutoff rule—any record with a probability greater than the cutoff is classified as a 1.

The logit function that maps a probability to a scale suitable for a linear model
Figure 5-2. The function that maps a probability to a scale suitable for a linear model (logit)

Logistic Regression and the GLM

The response in the logistic regression formula is the log odds of a binary outcome of 1. We only observe the binary outcome, not the log odds, so special statistical methods are needed to fit the equation. Logistic regression is a special instance of a generalized linear model (GLM) developed to extend linear regression to other settings.

In R, to fit a logistic regression, the glm function is used with the family parameter set to binomial. The following code fits a logistic regression to the personal loan data introduced in “K-Nearest Neighbors”.

logistic_model <- glm(outcome ~ payment_inc_ratio + purpose_ +
                        home_ + emp_len_ + borrower_score,
                      data=loan_data, family='binomial')
logistic_model

Call:  glm(formula = outcome ~ payment_inc_ratio + purpose_ + home_ +
	emp_len_ + borrower_score, family = "binomial", data = loan_data)

Coefficients:
               (Intercept)           payment_inc_ratio
                  -1.63809                    -0.07974
purpose_debt_consolidation    purpose_home_improvement
                  -0.24937                    -0.40774
    purpose_major_purchase             purpose_medical
                  -0.2296                     -0.51048
             purpose_other      purpose_small_business
                  -0.62066                    -1.21526
                  home_OWN                   home_RENT
                  -0.04833                    -0.15732
         emp_len_ > 1 Year              borrower_score
                  0.35673                      4.61264

Degrees of Freedom: 45341 Total (i.e. Null);  45330 Residual
Null Deviance: 	62860
Residual Deviance: 57510  	AIC: 57540

The response is outcome, which takes a 0 if the loan is paid off and 1 if the loan defaults. purpose_ and home_ are factor variables representing the purpose of the loan and the home ownership status. As in regression, a factor variable with P levels is represented with P – 1 columns. By default in R, the reference coding is used and the levels are all compared to the reference level (see “Factor Variables in Regression”). The reference levels for these factors are credit_card and MORTGAGE, respectively. The variable borrower_score is a score from 0 to 1 representing the creditworthiness of the borrower (from poor to excellent). This variable was created from several other variables using K-Nearest Neighbor: see “KNN as a Feature Engine”.

Generalized Linear Models

Generalized linear models (GLMs) are the second most important class of models besides regression. GLMs are characterized by two main components:

  • A probability distribution or family (binomial in the case of logistic regression)

  • A link function mapping the response to the predictors (logit in the case of logistic regression)

Logistic regression is by far the most common form of GLM. A data scientist will encounter other types of GLMs. Sometimes a log link function is used instead of the logit; in practice, use of a log link is unlikely to lead to very different results for most applications. The poisson distribution is commonly used to model count data (e.g., the number of times a user visits a web page in a certain amount of time). Other families include negative binomial and gamma, often used to model elapsed time (e.g., time to failure). In contrast to logistic regression, application of GLMs with these models is more nuanced and involves greater care. These are best avoided unless you are familiar with and understand the utility and pitfalls of these methods.

Predicted Values from Logistic Regression

The predicted value from logistic regression is in terms of the log odds: Y^=log(Odds(Y=1)). The predicted probability is given by the logistic response function:

p^=11+e-Y^

For example, look at the predictions from the model logistic_model:

pred <- predict(logistic_model)
summary(pred)
      Min.  1st Qu.    Median      Mean    3rd Qu.     Max.
-3.510000 -0.505100  0.008539 -0.002564  0.518800  2.705000

Converting these values to probabilities is a simple transform:

prob <- 1/(1 + exp(-pred))
> summary(prob)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.06269 0.37310 0.49790 0.50000 0.62360 0.97100

These are on a scale from 0 to 1 and don’t yet declare whether the predicted value is default or paid off. We could declare any value greater than 0.5 as default, analogous to the K-Nearest Neighbors classifier. In practice, a lower cutoff is often appropriate if the goal is to identify members of a rare class (see “The Rare Class Problem”).

Interpreting the Coefficients and Odds Ratios

One advantage of logistic regression is that it produces a model that can be scored to new data rapidly, without recomputation. Another is the relative ease of interpretation of the model, as compared with other classification methods. The key conceptual idea is understanding an odds ratio. The odds ratio is easiest to understand for a binary factor variable X:

oddsratio=Odds(Y=1|X=1)Odds(Y=1|X=0)

This is interpreted as the odds that Y = 1 when X = 1 versus the odds that Y = 1 when X = 0. If the odds ratio is 2, then the odds that Y = 1 are two times higher when X = 1 versus X = 0.

Why bother with an odds ratio, instead of probabilities? We work with odds because the coefficient βj in the logistic regression is the log of the odds ratio for Xj.

An example will make this more explicit. For the model fit in “Logistic Regression and the GLM”, the regression coefficient for purpose_small_business is 1.21526. This means that a loan to a small business compared to a loan to pay off credit card debt reduces the odds of defaulting versus being paid off by exp(1.21526)3.4. Clearly, loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans.

Figure 5-3 shows the relationship between the odds ratio and log-odds ratio for odds ratios greater than 1. Because the coefficients are on the log scale, an increase of 1 in the coefficient results in an increase of exp(1)2.72 in the odds ratio.

The relationship between the odds ratio and the log-odds ratio
Figure 5-3. The relationship between the odds ratio and the log-odds ratio

Odds ratios for numeric variables X can be interpreted similarly: they measure the change in the odds ratio for a unit change in X. For example, the effect of increasing the payment to income ratio from, say, 5 to 6 increases the odds of the loan defaulting by a factor of exp(0.08244)1.09. The variable borrower_score is a score on the borrowers’ creditworthiness and ranges from 0 (low) to 1 (high). The odds of the best borrowers relative to the worst borrowers defaulting on their loans is smaller by a factor of exp(-4.61264)0.01. In other words, the default risk from the borrowers with the poorest creditworthiness is 100 times greater than that of the best borrowers!

Linear and Logistic Regression: Similarities and Differences

Multiple linear regression and logistic regression share many commonalities. Both assume a parametric linear form relating the predictors with the response. Exploring and finding the best model are done in very similar ways. Generalities to the linear model to use a spline transform of the predictor are equally applicable in the logistic regression setting. Logistic regression differs in two fundamental ways:

  • The way the model is fit (least squares is not applicable)

  • The nature and analysis of the residuals from the model

Fitting the model

Linear regression is fit using least squares, and the quality of the fit is evaluated using RMSE and R-squared statistics. In logistic regression (unlike in linear regression), there is no closed-form solution and the model must be fit using maximum likelihood estimation (MLE). Maximum likelihood estimation is a process that tries to find the model that is most likely to have produced the data we see. In the logistic regression equation, the response is not 0 or 1 but rather an estimate of the log odds that the response is 1. The MLE finds the solution such that the estimated log odds best describes the observed outcome. The mechanics of the algorithm involve a quasi-Newton optimization that iterates between a scoring step (Fisher’s scoring), based on the current parameters, and an update to the parameters to improve the fit.

Fortunately, most users don’t need to concern themselves with the details of the fitting algorithm since this is handled by the software. Most data scientists will not need to worry about the fitting method, other than understanding that it is a way to find a good model under certain assumptions.

Handling Factor Variables

In logistic regression, factor variables should be coded as in linear regression; see “Factor Variables in Regression”. In R and other software, this is normally handled automatically and generally reference encoding is used. All of the other classification methods covered in this chapter typically use the one hot encoder representation (see “One Hot Encoder”).

Assessing the Model

Like other classification methods, logistic regression is assessed by how accurately the model classifies new data (see “Evaluating Classification Models”). As with linear regression, some additional standard statistical tools are available to assess and improve the model. Along with the estimated coefficients, R reports the standard error of the coefficients (SE), a z-value, and a p-value:

summary(logistic_model)

Call:
glm(formula = outcome ~ payment_inc_ratio + purpose_ + home_ +
	emp_len_ + borrower_score, family = "binomial", data = loan_data)

Deviance Residuals:
     Min        1Q    Median        3Q       Max
-2.15528  -1.07421   0.05853   1.06908   2.51951

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)
(Intercept)                -1.638092   0.073708 -22.224  < 2e-16 ***
payment_inc_ratio          -0.079737   0.002487 -32.058  < 2e-16 ***
purpose_debt_consolidation -0.249373   0.027615  -9.030  < 2e-16 ***
purpose_home_improvement   -0.407743   0.046615  -8.747  < 2e-16 ***
purpose_major_purchase     -0.229628   0.053683  -4.277 1.89e-05 ***
purpose_medical            -0.510479   0.086780  -5.882 4.04e-09 ***
purpose_other              -0.620663   0.039436 -15.738  < 2e-16 ***
purpose_small_business     -1.215261   0.063320 -19.192  < 2e-16 ***
home_OWN                   -0.048330   0.038036  -1.271    0.204
home_RENT                  -0.157320   0.021203  -7.420 1.17e-13 ***
emp_len_ > 1 Year           0.356731   0.052622   6.779 1.21e-11 ***
borrower_score              4.612638   0.083558  55.203  < 2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 62857  on 45341  degrees of freedom
Residual deviance: 57515  on 45330  degrees of freedom
AIC: 57539

Number of Fisher Scoring iterations: 4

Interpretation of the p-value comes with the same caveat as in regression, and should be viewed more as a relative indicator of variable importance (see “Assessing the Model”) than as a formal measure of statistical significance. A logistic regression model, which has a binary response, does not have an associated RMSE or R-squared. Instead, a logistic regression model is typically evaluated using more general metrics for classification; see “Evaluating Classification Models”.

Many other concepts for linear regression carry over to the logistic regression setting (and other GLMs). For example, you can use stepwise regression, fit interaction terms, or include spline terms. The same concerns regarding confounding and correlated variables apply to logistic regression (see “Interpreting the Regression Equation”). You can fit generalized additive models (see “Generalized Additive Models”) using the mgcv package:

logistic_gam <- gam(outcome ~ s(payment_inc_ratio) + purpose_ +
                        home_ + emp_len_ + s(borrower_score),
                      data=loan_data, family='binomial')

One area where logistic regression differs is in the analysis of the residuals. As in regression (see Figure 4-9), it is straightforward to compute partial residuals:

terms <- predict(logistic_gam, type='terms')
partial_resid <- resid(logistic_model) + terms
df <- data.frame(payment_inc_ratio = loan_data[, 'payment_inc_ratio'],
                 terms = terms[, 's(payment_inc_ratio)'],
                 partial_resid = partial_resid[, 's(payment_inc_ratio)'])
ggplot(df, aes(x=payment_inc_ratio, y=partial_resid, solid = FALSE)) +
  geom_point(shape=46, alpha=.4) +
  geom_line(aes(x=payment_inc_ratio, y=terms),
            color='red', alpha=.5, size=1.5) +
  labs(y='Partial Residual')

The resulting plot is displayed in Figure 5-4. The estimated fit, shown by the line, goes between two sets of point clouds. The top cloud corresponds to a response of 1 (defaulted loans), and the bottom cloud corresponds to a response of 0 (loans paid off). This is very typical of residuals from a logistic regression since the output is binary. Partial residuals in logistic regression, while less valuable than in regression, are still useful to confirm nonlinear behavior and identify highly influential records.

Partial residuals from logistic regression
Figure 5-4. Partial residuals from logistic regression
Warning

Some of the output from the summary function can effectively be ignored. The dispersion parameter does not apply to logistic regression and is there for other types of GLMs. The residual deviance and the number of scoring iterations are related to the maximum likelihood fitting method; see “Maximum Likelihood Estimation”.

Further Reading

  1. The standard reference on logistic regression is Applied Logistic Regression, 3rd ed., by David Hosmer, Stanley Lemeshow, and Rodney Sturdivant (Wiley).

  2. Also popular are two books by Joseph Hilbe: Logistic Regression Models (very comprehensive) and Practical Guide to Logistic Regression (compact), both from CRC Press.

  3. Elements of Statistical Learning, 2nd ed., by Trevor Hastie, Robert Tibshirani, Jerome Freidman, and its shorter cousin, An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (both from Springer) both have a section on logistic regression.

  4. Data Mining for Business Analytics, 3rd ed., by Galit Shmueli, Peter Bruce, and Nitin Patel (Wiley, 2016, with variants for R, Excel, and JMP) has a full chapter on logistic regression.

Evaluating Classification Models

It is common in predictive modeling to try out a number of different models, apply each to a holdout sample (also called a test or validation sample), and assess their performance. Fundamentally, this amounts to seeing which produces the most accurate predictions.

A simple way to measure classification performance is to count the proportion of predictions that are correct.

In most classification algorithms, each case is assigned an “estimated probability of being a 1.”3 The default decision point, or cutoff, is typically 0.50 or 50%. If the probability is above 0.5, the classification is “1,” otherwise it is “0.” An alternative default cutoff is the prevalent probability of 1s in the data.

Accuracy is simply a measure of total error:

accuracy=TruePositive+TrueNegativeSampleSize

Confusion Matrix

At the heart of classification metrics is the confusion matrix. The confusion matrix is a table showing the number of correct and incorrect predictions categorized by type of response. Several packages are available in R to compute a confusion matrix, but in the binary case, it is simple to compute one by hand.

To illustrate the confusion matrix, consider the logistic_gam model that was trained on a balanced data set with an equal number of defaulted and paid-off loans (see Figure 5-4). Following the usual conventions Y = 1 corresponds to the event of interest (e.g., default) and Y = 0 corresponds to a negative (or usual) event (e.g., paid off). The following computes the confusion matrix for the logistic_gam model applied to the entire (unbalanced) training set:

pred <- predict(logistic_gam, newdata=train_set)
pred_y <- as.numeric(pred > 0)
true_y <- as.numeric(train_set$outcome=='default')
true_pos <- (true_y==1) & (pred_y==1)
true_neg <- (true_y==0) & (pred_y==0)
false_pos <- (true_y==0) & (pred_y==1)
false_neg <- (true_y==1) & (pred_y==0)
conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
                     sum(false_neg), sum(true_neg)), 2, 2)
colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
rownames(conf_mat) <- c('Y = 1', 'Y = 0')
conf_mat
      Yhat = 1 Yhat = 0
Y = 1 14295    8376
Y = 0 8052     14619

The predicted outcomes are columns and the true outcomes are the rows. The diagonal elements of the matrix show the number of correct predictions and the off-diagonal elements show the number of incorrect predictions. For example, 14,295 defaulted loans were correctly predicted as a default, but 8,376 defaulted loans were incorrectly predicted as paid off.

Figure 5-5 shows the relationship between the confusion matrix for a binary reponse Y and different metrics (see “Precision, Recall, and Specificity” for more on the metrics). As with the example for the loan data, the actual response is along the rows and the predicted response is along the columns. (You may see confusion matrices with this reversed.) The diagonal boxes (upper left, lower right) show when the predictions Y^ correctly predict the response. One important metric not explicitly called out is the false positive rate (the mirror image of precision). When 1s are rare, the ratio of false positives to all predicted positives can be high, leading to the unintuitive situation where a predicted 1 is most likely a 0. This problem plagues medical screening tests (e.g., mammograms) that are widely applied: due to the relative rarity of the condition, positive test results most likely do not mean breast cancer. This leads to much confusion in the public.

images/confusion-matrix-terms.png
Figure 5-5. Confusion matrix for a binary response and various metrics

The Rare Class Problem

In many cases, there is an imbalance in the classes to be predicted, with one class much more prevalent than the other—for example, legitimate insurance claims versus fraudulent ones, or browsers versus purchasers at a website. The rare class (e.g., the fraudulent claims) is usually the class of more interest, and is typically designated 1, in contrast to the more prevalent 0s. In the typical scenario, the 1s are the more important case, in the sense that misclassifying them as 0s is costlier than misclassfying 0s as 1s. For example, correctly identifying a fraudulent insurance claim may save thousands of dollars. On the other hand, correctly identifying a nonfraudulent claim merely saves you the cost and effort of going through the claim by hand with a more careful review (which is what you would do if the claim were tagged as “fraudulent”).

In such cases, unless the classes are easily separable, the most accurate classification model may be one that simply classifies everything as a 0. For example, if only 0.1% of the browsers at a web store end up purchasing, a model that predicts that each browser will leave without purchasing will be 99.9% accurate. However, it will be useless. Instead, we would be happy with a model that is less accurate overall, but is good at picking out the purchasers, even if it misclassifies some nonpurchasers along the way.

Precision, Recall, and Specificity

Metrics other than pure accuracy—metrics that are more nuanced—are commonly used in evaluating classification models. Several of these have a long history in statistics—especially biostatistics, where they are used to describe the expected performance of diagnostic tests. The precision measures the accuracy of a predicted positive outcome (see Figure 5-5):

precision=TruePositiveTruePositive+FalsePositive

The recall, also known as sensitivity, measures the strength of the model to predict a positive outcome—the proportion of the 1s that it correctly identifies (see Figure 5-5). The term sensitivity is used a lot in biostatistics and medical diagnostics, whereas recall is used more in the machine learning community. The definition of recall is:

recall=TruePositiveTruePositive+FalseNegative

Another metric used is specificity, which measures a model’s ability to predict a negative outcome:

specificity=TrueNegativeTrueNegative+FalsePositive
# precision
conf_mat[1,1]/sum(conf_mat[,1])
# recall
conf_mat[1,1]/sum(conf_mat[1,])
# specificity
conf_mat[2,2]/sum(conf_mat[2,])

ROC Curve

You can see that there is a tradeoff between recall and specificity. Capturing more 1s generally means misclassifying more 0s as 1s. The ideal classifier would do an excellent job of classifying the 1s, without misclassifying more 0s as 1s.

The metric that captures this tradeoff is the “Receiver Operating Characteristics” curve, usually referred to as the ROC curve. The ROC curve plots recall (sensitivity) on the y-axis against specificity on the x-axis.4 The ROC curve shows the trade-off between recall and specificity as you change the cutoff to determine how to classify a record. Sensitivity (recall) is plotted on the y-axis, and you may encounter two forms in which the x-axis is labeled:

  • Specificity plotted on the x-axis, with 1 on the left and 0 on the right

  • Specificity plotted on the x-axis, with 0 on the left and 1 on the right

The curve looks identical whichever way it is done. The process to compute the ROC curve is:

  1. Sort the records by the predicted probability of being a 1, starting with the most probable and ending with the least probable.

  2. Compute the cumulative specificity and recall based on the sorted records.

Computing the ROC curve in R is straightforward. The following code computes ROC for the loan data:

idx <- order(-pred)
recall <- cumsum(true_y[idx]==1)/sum(true_y==1)
specificity <- (sum(true_y==0) - cumsum(true_y[idx]==0))/sum(true_y==0)
roc_df <- data.frame(recall = recall, specificity = specificity)
ggplot(roc_df, aes(x=specificity, y=recall)) +
  geom_line(color='blue') +
  scale_x_reverse(expand=c(0, 0)) +
  scale_y_continuous(expand=c(0, 0)) +
  geom_line(data=data.frame(x=(0:100)/100), aes(x=x, y=1-x),
            linetype='dotted', color='red')

The result is shown in Figure 5-6. The dotted diagonal line corresponds to a classifier no better than random chance. An extremely effective classifier (or, in medical situations, an extremely effective diagnostic test) will have an ROC that hugs the upper-left corner—it will correctly identify lots of 1s without misclassifying lots of 0s as 1s. For this model, if we want a classifier with a specificity of at least 50%, then the recall is about 75%.

ROC curve for the loan data
Figure 5-6. ROC curve for the loan data

Precision-Recall Curve

In addition to ROC curves, it can be illuminating to examine the precision-recall (PR) curve. PR curves are computed in a similar way except that the data is ordered from least to most probable and cumulative precision and recall statistics are computed. PR curves are especially useful in evaluating data with highly unbalanced outcomes.

AUC

The ROC curve is a valuable graphical tool but, by itself, doesn’t constitute a single measure for the performance of a classifier. The ROC curve can be used, however, to produce the area underneath the curve (AUC) metric. AUC is simply the total area under the ROC curve. The larger the value of AUC, the more effective the classifier. An AUC of 1 indicates a perfect classifier: it gets all the 1s correctly classified, and doesn’t misclassify any 0s as 1s.

A completely ineffective classifier—the diagonal line—will have an AUC of 0.5.

Figure 5-7 shows the area under the ROC curve for the loan model. The value of AUC can be computed by a numerical integration:

sum(roc_df$recall[-1] * diff(1-roc_df$specificity))
    [1] 0.6926172

The model has an AUC of about 0.69, corresponding to a relatively weak classifier.

Area under the ROC curve for the loan data
Figure 5-7. Area under the ROC curve for the loan data

False Positive Rate Confusion

False positive/negative rates are often confused or conflated with specificity or sensitivity (even in publications and software!). Sometimes the false positive rate is defined as the proportion of true negatives that test positive. In many cases (such as network intrusion detection), the term is used to refer to the proportion of positive signals that are true negatives.

Lift

Using the AUC as a metric is an improvement over simple accuracy, as it can assess how well a classifier handles the tradeoff between overall accuracy and the need to identify the more important 1s. But it does not completely address the rare-case problem, where you need to lower the model’s probability cutoff below 0.5 to avoid having all records classified as 0. In such cases, for a record to be classified as a 1, it might be sufficient to have a probability of 0.4, 0.3, or lower. In effect, we end up overidentifying 1s, reflecting their greater importance.

Changing this cutoff will improve your chances of catching the 1s (at the cost of misclassifying more 0s as 1s). But what is the optimum cutoff?

The concept of lift lets you defer answering that question. Instead, you consider the records in order of their predicted probability of being 1s. Say, of the top 10% classified as 1s, how much better did the algorithm do, compared to the benchmark of simply picking blindly? If you can get 0.3% response in this top decile instead of the 0.1% you get overall picking randomly, the algorithm is said to have a lift (also called gains) of 3 in the top decile. A lift chart (gains chart) quantifies this over the range of the data. It can be produced decile by decile, or continuously over the range of the data.

To compute a lift chart, you first produce a cumulative gains chart that shows the recall on the y-axis and the total number of records on the x-axis. The lift curve is the ratio of the cumulative gains to the diagonal line corresponding to random selection. Decile gains charts are one of the oldest techniques in predictive modeling, dating from the days before internet commerce. They were particularly popular among direct mail professionals. Direct mail is an expensive method of advertising if applied indiscriminately, and advertisers used predictive models (quite simple ones, in the early days) to identify the potential customers with the likeliest prospect of payoff.

Uplift

Sometimes the term uplift is used to mean the same thing as lift. An alternate meaning is used in a more restrictive setting, when an A-B test has been conducted and the treatment (A or B) is then used as a predictor variable in a predictive model. The uplift is the improvement in response predicted for an individual case with treatment A versus treatment B. This is determined by scoring the individual case first with the predictor set to A, and then again with the predictor toggled to B. Marketers and political campaign consultants use this method to determine which of two messaging treatments should be used with which customers or voters.

A lift curve lets you look at the consequences of setting different probability cutoffs for classifying records as 1s. It can be an intermediate step in settling on an appropriate cutoff level. For example, a tax authority might only have a certain amount of resources that it can spend on tax audits, and want to spend them on the likeliest tax cheats. With its resource constraint in mind, the authority would use a lift chart to estimate where to draw the line between tax returns selected for audit and those left alone.

Further Reading

Evaluation and assessment are typically covered in the context of a particular model (e.g., K-Nearest Neighbors or decision trees); three books that handle it in its own chapter are:

  • Data Mining, 3rd ed., by Ian Whitten, Elbe Frank, and Mark Hall (Morgan Kaufmann, 2011).

  • Modern Data Science with R by Benjamin Baumer, Daniel Kaplan, and Nicholas Horton (CRC Press, 2017).

  • Data Mining for Business Analytics, 3rd ed., by Galit Shmueli, Peter Bruce, and Nitin Patel (Wiley, 2016, with variants for R, Excel, and JMP).

Strategies for Imbalanced Data

The previous section dealt with evaluation of classification models using metrics that go beyond simple accuracy, and are suitable for imbalanced data—data in which the outcome of interest (purchase on a website, insurance fraud, etc.) is rare. In this section, we look at additional strategies that can improve predictive modeling performance with imbalanced data.

Undersampling

If you have enough data, as is the case with the loan data, one solution is to undersample (or downsample) the prevalent class, so the data to be modeled is more balanced between 0s and 1s. The basic idea in undersampling is that the data for the dominant class has many redundant records. Dealing with a smaller, more balanced data set yields benefits in model performance, and makes it easier to prepare the data, and to explore and pilot models.

How much data is enough? It depends on the application, but in general, having tens of thousands of records for the less dominant class is enough. The more easily distinguishable the 1s are from the 0s, the less data needed.

The loan data analyzed in “Logistic Regression” was based on a balanced training set: half of the loans were paid off and the other half were in default. The predicted values were similar: half of the probabilities were less than 0.5 and half were greater than 0.5. In the full data set, only about 19% of the loans were in default:

mean(full_train_set$outcome=='default')
[1] 0.1889455

What happens if we use the full data set to train the model?

full_model <- glm(outcome ~ payment_inc_ratio + purpose_ +
                   home_ + emp_len_+ dti + revol_bal + revol_util,
                 data=full_train_set, family='binomial')
pred <- predict(full_model)
mean(pred > 0)
[1] 0.003942094

Only 0.39% of the loans are predicted to be in default, or less than 1/12 of the expected number. The loans that were paid off overwhelm the loans in default because the model is trained using all the data equally. Thinking about it intuitively, the presence of so many nondefaulting loans, coupled with the inevitable variability in predictor data, means that, even for a defaulting loan, the model is likely to find some nondefaulting loans that it is similar to, by chance. When a balanced sample was used, roughly 50% of the loans were predicted to be in default.

Oversampling and Up/Down Weighting

One criticism of the undersampling method is that it throws away data and is not using all the information at hand. If you have a relatively small data set, and the rarer class contains a few hundred or a few thousand records, then undersampling the dominant class has the risk of throwing out useful information. In this case, instead of downsampling the dominant case, you should oversample (upsample) the rarer class by drawing additional rows with replacement (bootstrapping).

You can achieve a similar effect by weighting the data. Many classification algorithms take a weight argument that will allow you to up/down weight the data. For example, apply a weight vector to the loan data using the weight argument to glm:

wt <- ifelse(full_train_set$outcome=='default',
              1/mean(full_train_set$outcome == 'default'), 1)
full_model <- glm(outcome ~ payment_inc_ratio + purpose_ +
                    home_ + emp_len_+ dti + revol_bal + revol_util,
                  data=full_train_set, weight=wt, family='quasibinomial')
pred <- predict(full_model)
mean(pred > 0)
[1] 0.5767208

The weights for loans that default are set to 1p where p is the probability of default. The nondefaulting loans have a weight of 1. The sum of the weights for the defaulted loans and nondefaulted loans are roughly equal. The mean of the predicted values is now about 58% instead of 0.39%.

Note that weighting provides an alternative to both upsampling the rarer class and downsampling the dominant class.

Adapting the Loss Function

Many classification and regression algorithms optimize a certain criteria or loss function. For example, logistic regression attempts to minimize the deviance. In the literature, some propose to modify the loss function in order to avoid the problems caused by a rare class. In practice, this is hard to do: classification algorithms can be complex and difficult to modify. Weighting is an easy way to change the loss function, discounting errors for records with low weights in favor of records of higher weights.

Data Generation

A variation of upsampling via bootstrapping (see “Oversampling and Up/Down Weighting”) is data generation by perturbing existing records to create new records. The intuition behind this idea is that since we only observe a limited set of instances, the algorithm doesn’t have a rich set of information to build classification “rules.” By creating new records that are similar but not identical to existing records, the algorithm has a chance to learn a more robust set of rules. This notion is similar in spirit to ensemble statistical models such as boosting and bagging (see Chapter 6).

The idea gained traction with the publication of the SMOTE algorithm, which stands for “Synthetic Minority Oversampling Technique.” The SMOTE algorithm finds a record that is similar to the record being upsampled (see “K-Nearest Neighbors”) and creates a synthetic record that is a randomly weighted average of the original record and the neighboring record, where the weight is generated separately for each predictor. The number of synthetic oversampled records created depends on the oversampling ratio required to bring the data set into approximate balance, with respect to outcome classes.

There are several implementations of SMOTE in R. The most comprehensive package for handling unbalanced data is unbalanced. It offers a variety of techniques, including a “Racing” algorithm to select the best method. However, the SMOTE algorithm is simple enough that it can be implemented directly in R using the FNN package.

Cost-Based Classification

In practice, accuracy and AUC are a poor man’s way to choose a classification rule. Often, an estimated cost can be assigned to false positives versus false negatives, and it is more appropriate to incorporate these costs to determine the best cutoff when classifying 1s and 0s. For example, suppose the expected cost of a default of a new loan is C and the expected return from a paid-off loan is R. Then the expected return for that loan is:

expectedreturn=P(Y=0)×R+P(Y=1)×C

Instead of simply labeling a loan as default or paid off, or determining the probability of default, it makes more sense to determine if the loan has a positive expected return. Predicted probability of default is an intermediate step, and it must be combined with the loan’s total value to determine expected profit, which is the ultimate planning metric of business. For example, a smaller value loan might be passed over in favor of a larger one with a slightly higher predicted default probability.

Exploring the Predictions

A single metric, such as AUC, cannot capture all aspects of the appropriateness of a model for a situation. Figure 5-8 displays the decision rules for four different models fit to the loan data using just two predictor variables: borrower_score and payment_inc_ratio. The models are linear discriminant analysis (LDA), logistic linear regression, logistic regression fit using a generalized additive model (GAM) and a tree model (see “Tree Models”). The region to the upper-left of the lines corresponds to a predicted default. LDA and logistic linear regression give nearly identical results in this case. The tree model produces the least regular rule: in fact, there are situations in which increasing the borrower score shifts the prediction from “paid-off” to “default”! Finally, the GAM fit of the logistic regression represents a compromise between the tree models and the linear models.

Comparison of the classification rules for four different methods. Logistic and LDA produce nearly identical, overlapping linear classifiers.
Figure 5-8. Comparison of the classification rules for four different methods

It is not easy to visualize the prediction rules in higher dimensions, or in the case of the GAM and tree model, even generate the regions for such rules.

In any case, exploratory analysis of predicted values is always warranted.

Further Reading

Summary

Classification, the process of predicting which of two (or a small number of) categories a record belongs to, is a fundamental tool of predictive analytics. Will a loan default (yes or no)? Will it prepay? Will a web visitor click on a link? Will she purchase something? Is an insurance claim fraudulent? Often in classification problems, one class is of primary interest (e.g., the fraudulent insurance claim) and, in binary classification, this class is designated as a 1, with the other, more prevalent class being a 0. Often, a key part of the process is estimating a propensity score, a probability of belonging to the class of interest. A common scenario is one in which the class of interest is relatively rare. The chapter concludes with a discussion of a variety of model assessment metrics that go beyond simple accuracy; these are important in the rare-class situation, when classifying all records as 0s can yield high accuracy.

1 This and subsequent sections in this chapter © 2017 Datastats, LLC, Peter Bruce and Andrew Bruce, used by permission.

2 It is certainly surprising that the first article on statistical classification was published in a journal devoted to eugenics. Indeed, there is a disconcerting connection between the early development of statistics and eugenics.

3 Not all methods provide unbiased estimates of probability. In most cases, it is sufficient that the method provide a ranking equivalent to the rankings that would result from an unbiased probability estimate; the cutoff method is then functionally equivalent.

4 The ROC curve was first used during World War II to describe the performance of radar receiving stations, whose job was to correctly identify (classify) reflected radar signals, and alert defense forces to incoming aircraft.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset