Chapter 6: Logistic Regression

Introduction

Dependence Technique

The Linear Probability Model

The Logistic Function

A Straightforward Example Using JMP

Create a Dummy Variable

Use a Contingency Table to Determine the Odds Ratio

Calculate the Odds Ratio

A Realistic Logistic Regression Statistical Study

Understand the Model-Building Approach

Run Bivariate Analyses

Run the Initial Regression and Examine the Results

Convert a Continuous Variable to Discrete Variables

Produce Interaction Variables

Validate and Use the Model

Exercises

Introduction

Linear regression is designed for a continuous dependent variable, such as interest rates, budgets, and so on. Very often the dependent variable is not continuous, but discrete. There are many important situations in which the dependent variable is binary–that is, it can take on only two possible values. For example, will the loan applicant default? Will the cellphone customer switch to another carrier? Will a consumer buy a particular product? All these situations have a binary dependent variable, and, as will be seen in this chapter, linear regression cannot be used for a binary dependent variable. Consequently, statisticians have developed a specialized form of regression call logistic regression to handle these situations. This chapter shows you how to use logistic regression to run a regression when the dependent variable is binary.

Dependence Technique

Logistic regression, as shown in our multivariate analysis framework in Figure 6.1, is one of the dependence techniques in which the dependent variable is discrete and, more specifically, binary. That is, it takes on only two possible values. The following are some examples:

   Will a credit card applicant pay off a bill or not?

   Will a mortgage applicant default?

   Will someone who receives a direct mail solicitation respond to the solicitation?

In each of these cases, the answer is either “yes” or “no.” Such a categorical variable cannot directly be used as a dependent variable in a regression. But a simple transformation solves the problem: Let the dependent variable Y take on the value 1 for “yes” and 0 for “no.”

Figure 6.1: A Framework for Multivariate Analysis

image

Because Y takes on only the values 0 and 1, you know E[Yi] = 1 * P[Yi = 1] + 0 * P[Yi = 0] = P[Yi = 1]. But from the theory of regression, you also know that E[Yi] = a + b * Xi. (Simple regression is used here, but the same holds true for multiple regression.) Combining these two results, you have P[Yi = 1] = a + b * Xi. You can see that, in the case of a binary dependent variable, the regression might be interpreted as a probability. You then seek to use this regression to estimate the probability that Y takes on the value 1. If the estimated probability is high enough (for example, above .5), then you predict 1. Conversely, if the estimated probability of a 1 is low enough (for example, below .5), then you predict 0.

The Linear Probability Model

When linear regression is applied to a binary dependent variable, it is commonly called the linear probability model (LPM). Traditional linear regression is designed for a continuous dependent variable, and is not well-suited to handling a binary dependent variable. Three primary difficulties arise in the LPM. First, the predictions from a linear regression do not necessarily fall between zero and one. What are you to make of a predicted probability greater than one? How do you interpret a negative probability? A model that is capable of producing such nonsensical results does not inspire confidence.

Second, for any given predicted value of y (denoted ŷ), the residual (resid = yŷ) can take only two values. For example, if ŷ = 0.37, then the only possible values for the residual are resid = −0.37 or resid = 0.63 (= 1 − 0.37), because it has to be the case that ŷ + resid equals zero or 1. Clearly, the residuals will not be normal. Plotting a graph of ŷ versus resid will not produce a nice scatter of points, but two parallel lines. You should verify this assertion by running such a regression and making the requisite scatterplot. A further implication of the fact that the residual can take on only two values for any ŷ is that the residuals are heteroscedastic. This violates the linear regression assumption of homoscedasticity (constant variance). The estimates of the standard errors of the regression coefficients will not be stable, and inference will be unreliable.

Third, the linearity assumption is likely to be invalid, especially at the extremes of the independent variable. Suppose that you are modeling the probability that a consumer will pay back a $10,000 loan as a function of his or her income. The dependent variable is binary: 1 = the consumer pays back the loan, and 0 = the consumer does not pay back the loan. The independent variable is income, measured in dollars. A consumer whose income is $50,000 might have a probability of 0.5 of paying back the loan. If the consumer’s income is increased by $5,000, then the probability of paying back the loan might increase to 0.55, so that every $1,000 increase in income increases the probability of paying back the loan by 1%. A person with an income of $150,000 (who can pay the loan back very easily) might have a probability of 0.99 of paying back the loan. What happens to this probability when the consumer’s income is increased by $5,000? Probability cannot increase by 5%, because then it would exceed 100%. Yet, according to the linearity assumption of linear regression, it must do so.

The Logistic Function

A better way to model P[Yi = ] would be to use a function that is not linear, one that increases slowly when P[Yi = 1] is close to zero or one and that increases more rapidly in between. It would have an “S” shape. One such function is the logistic function whose cumulative distribution function is shown in Figure 6.2:

G(z)= 11+ez=ez1+ez

G(z)= 11+ez=ez1+ez

Figure 6.2: The Logistic Function

image

Another useful representation of the logistic function is the following:

1G(z)= ez1+ ez

1G(z)= ez1+ ez

Recognize that the y-axis, G(z), is a probability, and let G(z) = π, the probability of the event’s occurring. You can form the odds ratio (the probability of the event occurring divided by the probability of the event not occurring) and do some simplifying:

π1π=G(z)1G(z)= 11+ez ez1+ ez=1ez=ez

π1π=G(z)1G(z)= 11+ez ez1+ ez=1ez=ez

Consider taking the natural logarithm of both sides. The left side will become log[π/(1 − π)]. The log of the odds ratio is called the logit. The right side will become z (since log(ez = z), so that you have the following relation, which is called the logit transformation:

log[π1π]=z

If you model the logit as a linear function of X (that is, let z = β0 = β1X), then you have the following:

log[π1π]=β0+β1X

You could estimate this model by linear regression and obtain estimates b0 of β0 and b1 of β1 if only you knew the log of the odds ratio for each observation. Since you do not know the log of the odds ratio for each observation, you will use a form of nonlinear regression called logistic regression to estimate the following model:

E[Yi]=πi=G(β0+β1Xi)=11+eβ0β1Xi

In so doing, you obtain the desired estimates b0 of β0 and b1 of β1. The estimated probability for an observation Xi will be as follows:

P[Yi=1]= ˆπi=11+eb0b1Xi

And the corresponding estimated logit will be the following:

log[ˆπ1ˆπ]=b0+b1X

This leads to a natural interpretation of the estimated coefficient in a logistic regression: b1 is the estimated change in the logit (log odds) for a one-unit change in X.

A Straightforward Example Using JMP

To make these ideas concrete, suppose you open a small data set toylogistic.jmp that contains students’ midterm exam scores (MidtermScore) and whether the student passed the class (PassClass = 1 if pass, PassClass = 0 if fail). A passing grade for the midterm is 70.

Create a Dummy Variable

The first thing to do is create a dummy variable to indicate whether the student passed the midterm: PassMidterm = 1 if MidtermScore ≥ 70; and PassMidterm = 0 otherwise.

1.   Select ColsNew Columns to open the New Column dialog box.

2.   In the Column Name text box, for your new dummy variable, type PassMidterm.

3.   Click the drop-down menu for modeling type and change it to Nominal.

4.   Click the drop-down menu for Column Properties and select Formula. The Formula dialog box appears.

5.   Under Functions, click Conditional ▶ If.

6.   Under Table Columns, click MidtermScore so that it appears in the top box to the right of the If.

7.   Under Functions, click Comparison“a>=b”. In the formula box to the right of >=, enter 70. Press the Tab key.

8.   Click in the box to the right of the arrow, and enter the number 1. Similarly, enter 0 for the else clause. The Formula dialog box should look like Figure 6.3. Click OK, and then click OK again.

Figure 6.3: Formula Dialog Box

image

Use a Contingency Table to Determine the Odds Ratio

First, you will use a traditional contingency table analysis to determine the odds ratio. Make sure that both PassClass and PassMidterm are classified as nominal variables:

1.   Right-click in the data grid of the column PassClass and select Column Info.

2.   Click the black triangle next to Modeling Type and select NominalOK. Do the same for PassMidterm.

3.   Select AnalyzeTabulate to open the Tabulate dialog box. It shows the general layout for a table.

4.   Drag PassClass to the Drop zone for columns. Now that data have been added, the words Drop zone for rows will no longer be visible. But the Drop zone for rows will still be in the lower left panel of the table. See Figure 6.4.

5.   Drag PassMidterm to the panel immediately to the left of the 8 in the table. Click Done. A contingency table identical to Figure 6.5 will appear.

Figure 6.4: Control Panel for Tabulate

image

Figure 6.5: Contingency Table from toylogistic.jmp

image

The probability of passing the class when you did not pass the midterm is as follows:

P(PassClass = 1) | P(PassMidterm = 0) = 2/7

The probability of not passing the class when you did not pass the midterm is as follows (similar to row percentages):

P(PassClass = 0) | P(PassMidterm = 0) = 5/7

The odds of passing the class when you have failed the midterm are as follows:

P(PassClass = 1) |P (PassMidterm = 0) P (PassClass = 0) | P((PassMidterm = 0)=2757=25

Similarly, you calculate the odds of passing the class when you have passed the midterm as follows:

P(PassClass = 1)|P(PassMidterm = 1) P(PassClass = 0)|P(PassMidterm = 1)=1013313=103

Of the students that did pass the midterm, the odds are the number of students that pass the class divided by the number of students that did not pass the class.

Calculate the Odds Ratio

So far, you have considered only odds. Now calculate an odds ratio. It is important to note that this can be done in two equivalent ways.

Method 1: Compute the Probabilities

Suppose you want to know the odds ratio of passing the class by comparing those who pass the midterm (PassMidterm = 1 in the numerator) to those who fail the midterm (PassMidterm = 0 in the denominator). The usual calculation leads to the following:

Odds of passing the class; given passed the MidtermOdds of passing the class; given failed the Midterm = 10325 = 506 = 8.33

This equation has the following interpretation: If you pass the midterm, the odds of passing the class are 8.33 times the odds of failing the course. This odds ratio can be converted into a probability. You know that P(Y=1)/P(Y=0)=8.33. And by definition, P(Y=1)+P(Y=0)=1. So solving two equations in two unknowns yields P(Y=0) = (1/(1+8.33)) = (1/9.33)= 0.1072 and P(Y=1) = 0.8928. As a quick check, observe that 0.8928/0.1072=8.33. Note that the log-odds are ln(8.33) = 2.120. Of course, the user doesn’t have to perform all these calculations by hand; JMP will do them automatically. When a logistic regression has been run, simply clicking the red triangle and selecting Odds Ratios will do the trick.

Method 2: Run a Logistic Regression

Equivalently, you could compare those who fail the midterm (PassMidterm=0 in the numerator) to those who pass the midterm (PassMidterm = 1 in the denominator) and calculate as follows:

Odds of passing the class; given failed the MidtermOdds of passing the class; given passed the Midterm = 25103 = 650 = 18.33 = 0.12

This calculation tells us that the odds of failing the class are 0.12 times the odds of passing the class for a student who passes the midterm. Since P(Y = 0) = 1 − π (the probability of failing the midterm) is in the numerator of this odds ratio (OR), you must interpret it in terms of the event failing the midterm. It is easier to interpret the odds ratio when it is less than 1 by using the following transformation: (OR − 1) * 100%. Compared to a person who passes the midterm, a person who fails the midterm is 12% as likely to pass the class. Or equivalently, a person who fails the midterm is 88% less likely, (OR − 1) * 100% = (0.12 − 1) * 100% = −88%, to pass the class than someone who passed the midterm. Note that the log-odds are ln(0.12) = −2.12.

The relationships between probabilities, odds (ratios), and log-odds (ratios) are straightforward. An event with a small probability has small odds, and also has small log-odds. An event with a large probability has large odds and also large log-odds. Probabilities are always between zero and unity. Odds are bounded below by zero but can be arbitrarily large. Log-odds can be positive or negative and are not bounded, as shown in Figure 6.6. In particular, if the odds ratio is 1 (so the probability of either event is 0.50), then the log-odds equal zero. Suppose π = 0.55, so the odds ratio is 0.55/0.45 = 1.222. Then you say that the event in the numerator is (1.222 − 1) = 22.2% more likely to occur than the event in the denominator.

Different software applications adopt different conventions for handling the expression of odds ratios in logistic regression. By default, JMP uses the “log odds of 0/1” convention, which puts the 0 in the numerator and the 1 in the denominator. This is a consequence of the sort order of the columns, which will be addressed shortly.

Figure 6.6: Ranges of Probabilities, Odds, and Log-odds

image

To see the practical importance of this, rather than compute a table and perform the above calculations, you can simply run a logistic regression. It is important to make sure that PassClass is nominal and that PassMidterm is continuous. If PassMidterm is nominal, JMP will fit a different but mathematically equivalent model that will give different (but mathematically equivalent) results. The scope of the reason for this is beyond this book, but, in JMP, interested readers can consult HelpBooksModeling and Multivariate Methods and see Appendix A.

If you have been following along with this book, both variables ought to be classified as nominal, so PassMidterm needs to be changed to continuous:

1.   Right-click in the column PassMidterm in the data grid, and select Column Info.

2.   Click the black triangle next to Modeling Type, and select Continuous.

3.   Click OK.

Now that the dependent and independent variables are correctly classified as Nominal and Continuous, respectively, run the logistic regression:

1.   From the top menu, select AnalyzeFit Model.

2.   Select PassClassY.

3.   Select PassMidtermAdd. The Fit Model dialog box should now look like Figure 6.7.

4.   Click Run.

Figure 6.7: Fit Model Dialog Box

image

Figure 6.8 displays the logistic regression results.

Figure 6.8: Logistic Regression Results for toylogistic.jmp

image

Examine the Parameter Estimates

Examine the parameter estimates in Figure 6.8. The intercept is 0.91629073, and the slope is −2.1202635. The slope gives the expected change in the logit for a one-unit change in the independent variable (the expected change on the log of the odds ratio). However, if you simply exponentiate the slope (compute), e–2.1202635 = 0.12, then you get the 0/1 odds ratio.

There is no need for you to exponentiate the coefficient manually. JMP will do this for you:

Click the red triangle and click Odds Ratios. The Odds Ratios tables are added to the JMP output as shown in Figure 6.9.

Figure 6.9: Odds Ratios Tables Using the Nominal Independent Variable PassMidterm

image

Unit Odds Ratios refers to the expected change in the odds ratio for a one-unit change in the independent variable. Range Odds Ratios refers to the expected change in the odds ratio when the independent variable changes from its minimum to its maximum. Since the present independent variable is a binary 0/1 variable, these two definitions are the same. You get not only the odds ratio, but also a confidence interval. Notice the right-skewed confidence interval; this is typical of confidence intervals for odds ratios.

Change the Default Convention

To change from the default convention (log odds of 0/1), which puts the 0 in the numerator and the 1 in the denominator, do as follows:

1.   In the data table, select the PassClass column.

2.   Right-click to select Column Info.

3.   In the Column Info dialog box, under Column Properties, select Value Ordering.

4.   In the new dialog box, click on the value 1, and click Move Up as in Figure 6.10.

5.   Click OK.

Figure 6.10: Changing the Value Order

image

Then, when you rerun the logistic regression, although the parameter estimates will not change, the odds ratios will change to reflect the fact that the 1 is now in the numerator and the 0 is in the denominator.

If you make this change, revert to the default convention, by selecting Value Ordering, and then selecting Move up and the value 0.

Examine the Changed Results

The independent variable is not limited to being only a nominal (or ordinal) dependent variable; it can be continuous. In particular, examine the results, using the actual score on the midterm, with MidtermScore as an independent variable:

1.   Select AnalyzeFit Model.

2.   Select PassClassY.

3.    Select MidtermScoreAdd.

4.   Click Run.

This time the intercept is 25.6018754, and the slope is −0.3637609. So you expect the log-odds to decrease by 0.3637609 for every additional point scored on the midterm, as shown in Figure 6.11.

Figure 6.11: Parameter Estimates

image

To view the effect on the odds ratio itself, click the red triangle as before and click Odds Ratios. Figure 6.12 displays the Odds Ratios tables.

Figure 6.12: Odds Ratios Tables Using the Continuous Independent Variable MidtermScore

image

For a one-unit increase in the midterm score, the new odds ratio will be 69.51% of the old odds ratio. Or, equivalently, you expect to see a 30.5% reduction in the odds ratio (0.695057 – 1) * (100%=-30.5%). For example, suppose a hypothetical student has a midterm score of 75%. The student’s log odds of failing the class would be as follows:

25.6018754 − 0.3637609 * 75 = −1.680192

So the student’s odds of failing the class would be exp(−1.680192) = 0.1863382. That is, the student is much more likely to pass than fail. Converting odds to probabilities (0.1863328/(1 + 0.1863328) = 0.157066212786159), you see that the student’s probability of failing the class is 0.15707, and the probability of passing the class is 0.84293. Now, if the student’s score increased by one point to 76, then the log odds of failing the class would be 25.6018754 − 0.3637609 * 76 = −2.043953. Thus, the student’s odds of failing the class become exp(−2.043953) = 0.1295157. So, the probability of passing the class would rise to 0.885334, and the probability of failing the class would fall to 0.114666. With respect to the Unit Odds Ratio, which equals 0.695057, you see that a one-unit increase in the test score changes the odds ratio from 0.1863382 to 0.1295157. In accordance with the estimated coefficient for the logistic regression, the new odds ratio is 69.5% of the old odds ratio because 0.1295157/0.1863382 = 0.695057.

Compute Probabilities for Each Observation

Finally, you can use the logistic regression to compute probabilities for each observation. As noted, the logistic regression will produce an estimated logit for each observation. These estimated logits can be used, in the obvious way, to compute probabilities for each observation. Consider a student whose midterm score is 70. The student’s estimated logit is 25.6018754 − 0.3637609(70) = 0.1386124. Since exp(0.1386129) = 1.148679 = p/(1 − p), you can solve for p (the probability of failing) = 0.534597.

You can obtain the estimated logits and probabilities by clicking the red triangle on Nominal Logistic Fit and selecting Save Probability Formula. Four columns will be added to the worksheet: Lin[0], Prob[0], Prob[1], and Most Likely PassClass. For each observation, these four columns give the estimated logit, the probability of failing the class, and the probability of passing the class, respectively. Observe that the sixth student has a midterm score of 70. Look up this student’s estimated probability of failing (Prob[0]); it is very close to what you just calculated above. See Figure 6.13. The difference is that the computer carries 16 digits through its calculations, but you carried only six.

Figure 6.13: Verifying Calculation of Probability of Failing

image

The fourth column (Most Likely PassClass) classifies the observation as either 1 or 0, depending on whether the probability is greater than or less than 50%. You can observe how well your model classifies all the observations (using this cut-off point of 50%) by producing a confusion matrix. Click the red triangle and click Confusion matrix.

Figure 6.14 displays the confusion matrix for the example. The rows of the confusion matrix are the actual classification (that is, whether PassClass is 0 or 1). The columns are the predicted classification from the model (that is, the predicted 0/1 values from that last fourth column using your logistic model and a cutpoint of .50). Correct classifications are along the main diagonal from upper left to lower right. You see that the model has classified 6 students as not passing the class, and actually they did not pass the class. The model also classifies 10 students as passing the class when they actually did. The values on the other diagonal, both equal to 2, are misclassifications. The results of the confusion matrix will be examined in more detail when in the discussion on model comparison in Chapter 14.

Figure 6.14: Confusion Matrix

image

Check the Model

Of course, before you can use the model, you have to check the model’s assumptions. The first step is to verify the linearity of the logit. This can be done by plotting the estimated logit against PassClass:

1.   Select GraphScatterplot Matrix.

2.   Select Lin[0]Y, columns.

3.   Select MidtermScoreX.

4.   Click OK. As shown in Figure 6.15, the linearity assumption appears to be perfectly satisfied.

Figure 6.15: Scatterplot of Lin[0] and MidtermScore

image

Returning to the logistic output (Figure 6.8), the analog to the ANOVA F test for linear regression is found under the Whole Model Test (Figure 6.16), in which the Full and Reduced models are compared. The null hypothesis for this test is that all the slope parameters are equal to zero. Since Prob>ChiSq is 0.0004, this null hypothesis is soundly rejected. For a discussion of other statistics found here, such as BIC and Entropy R2, see the JMP Help.

Figure 6.16: Whole Model Test for the Toylogistic Data Set

image

The next important part of model checking is the Lack of Fit test. Figure 6.17 compares the model actually fitted to the saturated model. The saturated model is a model generated by JMP that contains as many parameters as there are observations. So it fits the data very well. The null hypothesis for this test is that there is no difference between the estimated model and the saturated model. If this hypothesis is rejected, then more variables (such as cross-product or squared terms) need to be added to the model. In the present case, as can be seen, Prob>ChiSq = 0.7032. You can therefore conclude that you do not need to add more terms to the model.

Figure 6.17: Lack-of-Fit-Test for Current Model

image

A Realistic Logistic Regression Statistical Study

Turn now to a more realistic data set with several independent variables. This discussion will also briefly present some of the issues that should be addressed and some of the thought processes during a statistical study.

Cellphone companies are very interested in determining which customers might switch to another company. This switching is called churning. Predicting which customers might be about to churn enables the company to make special offers to these customers, possibly stemming their defection. Churn.jmp contains data on 3,333 cellphone customers, including the variable Churn (0 means the customer stayed with the company and 1 means the customer left the company).

Understand the Model-Building Approach

Before you can begin constructing a model for customer churn, you need to understand model building for logistic regression. Statistics and econometrics texts devote entire chapters to this concept. In several pages, only a sketch the broad outline can be given.

The first thing to do is make sure that the data are loaded correctly. Observe that Churn is classified as Continuous; be sure to change it to Nominal. One way is to right-click in the Churn column in the data table, select Column Info, and under Modeling Type, click Nominal. Another way is to look at the list of variables on the left side of the data table, find Churn, click the blue triangle (which denotes a continuous variable), and change it to Nominal (the blue triangle then becomes a red histogram).

Make sure that all binary variables are classified as Nominal. This includes Intl_Plan, VMail_Plan, E_VMAIL_PLAN, and D_VMAIL_PLAN. Should Area_Code be classified as Continuous or Nominal? (Nominal is the correct answer!) CustServ_Call, the number of calls to customer service, could be treated as either continuous or nominal/ordinal; you treat it as continuous.

Suppose that you are building a linear regression model and you find that the number of variables is not so large that this cannot be done manually. One place to begin is by examining histograms and scatterplots of the continuous variables, and crosstabs of the categorical variables as discussed in Chapter 4. Another very useful device as discussed in Chapter 4 is the scatterplot/correlation matrix, which can, at a glance, suggest potentially useful independent variables that are correlated with the dependent variable. The scatterplot/correlation matrix approach cannot be used with logistic regression, which is nonlinear, but a similar method similar can be applied.

You are now faced with a situation similar to that discussed in Chapter 4. Your goal is to build a model that follows the principle of parsimony—that is, a model that explains as much as possible of the variation in Y and uses as few significant independent variables as possible. However, now with multiple logistic regression, you are in a nonlinear situation. You have four approaches that you could take. These approaches and some of their advantages and disadvantages are as follows:

●   Inclusion of all the variables. In this approach, you just enter all the independent variables into the model. An obvious advantage of this approach is that it is fast and easy. However, depending on the data set, most likely several independent variables will be insignificantly related to the dependent variable. This includes variables that are not significant can cause severe problems, which weakens the interpretation of the coefficients and lessens the prediction accuracy of the model. This approach definitely does not follow the principle of parsimony, and it can cause numerical problems for the nonlinear solver that can lead to a failure to obtain an answer.

●   Bivariate method. In this approach, you search for independent variables that might have predictive value for the dependent variable by running a series of bivariate logistic regressions. That is, you run a logistic regression for each of the independent variables, searching for “significant” relationships. A major advantage of this approach is that it is the one most agreed upon by statisticians (Hosmer and Lemeshow, 2001). On the other hand, this approach is not automated, is very tedious, and is limited by the analyst’s ability to run the regressions. That is, it is not practical with very large data sets. Further, it misses interaction terms, which, as you shall see, can be very important.

●   Stepwise approach. In this approach, you would use the Fit Model platform, change the Personality to Stepwise and Direction to Mixed. The Mixed option is like Forward Stepwise, but variables can be dropped after they have been added. An advantage of this approach is that it is automated; so, it is fast and easy. The disadvantage of the stepwise approach is that it can lead to possible interpretation errors and prediction errors, depending on the data set. However, using the Mixed option, as opposed to the Forward or Backward Direction option, tends to lessen the magnitude and likelihood of these problems.

   Decision trees. A decision tree is a data mining technique that can be used for variable selection and will be discussed in Chapter 10. The advantage of using the decision tree technique is that it is automated, fast, and easy to run. Further, it is a popular variable reduction approach taken by many data mining analysts (Pollack, 2008). However, somewhat like the stepwise approach, the decision tree approach can lead to some statistical issues. In this case, significant variables identified by a decision tree are very sample-dependent. These issues will be discussed further in Chapter 10.

No one approach is a clear winner. Nevertheless, it is recommended that you use the “Include all the variables” approach. If the data set is too large and/or you do not have the time, you should run both the stepwise and decision trees models and compare the results. The data set churn.jmp is not too large, so you will apply the bivariate approach.

It is traditional to choose α = 0.05. But in this preliminary stage, you adopt a more lax standard, α = 0.25. The reason for this is that you want to include, if possible, a group of variables that are not individually significant but together are significant. Having identified an appropriate set of candidate variables, run a logistic regression that includes all of them. Compare the coefficient estimates from the multiple logistic regression with the estimates from the bivariate logistic regressions. Look for coefficients that have changed in sign or have dramatically changed in magnitude, as well as changes in significance. Such changes indicate the inadequacy of the simple bivariate models, and confirm the necessity of adding more variables to the model.

Three important ways to improve a model are as follows:

   If the logit appears to be nonlinear when plotted against some continuous variable, one resolution is to convert the continuous variable to a few dummy variables (for example, three) that cut the variable at its 25th, 50th, and 75th percentiles.

   If a histogram shows that a continuous variable has an excess of observations at zero (which can lead to nonlinearity in the logit), add a dummy variable that equals one if the continuous variable is zero and equals zero otherwise.

   Finally, a seemingly numeric variable that is actually discrete can be broken up into a handful of dummy variables (for example, ZIP codes).

Before you can begin modeling, you must first explore the data. With your churn data set, creating and examining the histograms of the continuous variables reveals nothing much of interest, except VMail_Message, which has an excess of zeros. (See the second point in the previous paragraph.) Figure 6.18 shows plots for Intl_Calls and VMail_Message. To produce such plots, follow these steps:

1.   Select AnalyzeDistribution,

2.   Click Intl_Calls, and then Y, Columns.

3.   Select VMail_MessageY, Columns, and then click OK.

4.   To add the Normal Quantile Plot, click the red arrow next to Intl_Calls and select Normal Quantile Plot.

Here it is obvious that Intl_Calls is skewed right. Note that a logarithmic transformation of this variable might be needed in order to reduce the skewness, but you need not pursue the idea here.

Figure 6.18: Distribution of Intl_Calls and VMail_Message

image

A correlation matrix of the continuous variables (select GraphScatterplot Matrix and put the desired variables in Y, Columns) turns up a curious pattern. Day_Charge and Day_Mins, Eve_Charge and Eve_Mins, Night_Charge and Night_Mins, and Intl_Charge and Intl_Mins all are perfectly correlated. The charge is obviously a linear function of the number of minutes. Therefore, you can drop the Charge variables from your analysis. (You could also drop the “Mins” variables instead; it doesn’t matter which one you drop.) If your data set had a very large number of variables, the scatterplot matrix would be too big to comprehend. In such a situation, you would choose groups of variables for which to make scatterplot matrices, and examine those.

A scatterplot matrix for the four binary variables turns up an interesting association. E_VMAIL_PLAN and D_VMAIL_PLAN are perfectly correlated. Both have common 1s, and where the former has -1, the latter has zero. It would be a mistake to include both of these variables in the same regression (try it and see what happens). Delete E_VMAIL_PLAN from the data set, and also delete VMail_Plan because it agrees perfectly with E_VMAIL_PLAN: When the former has a “no,” the latter has a “−1,” and similarly for “yes” and “+1.”

Phone is more or less unique to each observation. (You ignore the possibility that two phone numbers are the same but have different area codes.) Therefore, it should not be included in the analysis. So, you will drop Phone from the analysis.

A scatterplot matrix between the remaining continuous and binary variables produces a curious pattern. D_VMAIL_PLAN and VMailMessage have a correlation of 0.96. They have zeros in common, and where the former has 1s, the latter has numbers. (See again point two in the above paragraph. You won’t have to create a dummy variable to solve the problem because D_VMAIL_PLAN will do the job nicely.)

To summarize, you have dropped 7 of the original 23 variables from the data set (Phone, Day_Charge, Eve_Charge, Night_Charge, Intl_Charge, E_VMAIL_PLAN, and VMail_Plan). So there are now 16 variables left, one of which is the dependent variable, Churn. You have 15 possible independent variables to consider.

Run Bivariate Analyses

Next comes the time-consuming task of running several bivariate (two variables, one dependent and one independent) analyses, some of which will be logistic regressions (when the independent variable is continuous) and some of which will be contingency tables (when the independent variable is categorical). In total, you have 15 bivariate analyses to run. What about Area Code? JMP reads it as a continuous variable, but it’s really nominal. So be sure to change it from continuous to nominal. Similarly, make sure that D_VMAIL_PLAN is set as a nominal variable, not continuous.

Do not try to keep track of the results in your head, or by referring to the 15 bivariate analyses that would fill your computer screen. Make a list of all 15 variables that need to be tested, and write down the test result (for example, the relevant p-value) and your conclusion (for example, “include” or “exclude”). This not only prevents simple errors; it is a useful record of your work should you have to come back to it later. There are few things more pointless than conducting an analysis that concludes with a 13-variable logistic regression, only to have some reason to rerun the analysis and now wind up with a 12-variable logistic regression. Unless you have documented your work, you will have no idea why the discrepancy exists or which is the correct regression.

Below you will see how to conduct both types of bivariate analyses, one for a nominal independent variable and one for a continuous independent variable. The other 14 to the reader.

Make a contingency table of Churn versus State:

1.   Select AnalyzeFit Y by X.

2.   Click Churn (which is nominal) and then click Y, Response.

3.   Click State and then click X, Factor.

4.   Click OK.

At the bottom of the table of results are the Likelihood Ratio and Pearson tests, both of which test the null hypothesis that State does not affect Churn, and both of which reject the null. The conclusion is that the variable State matters. On the other hand, perform a logistic regression of Churn on VMail_Message:

1.   Select AnalyzeFit Y by X.

2.   Click Churn.

3.   Click Y, Response.

4.   Click VMail_Message and click X, Factor.

5.   Click OK.

Under “Whole Model Test” that Prob>ChiSq, the p-value is less than 0.0001, so you conclude that VMail_message affects Churn. Remember that for all these tests, you are setting α (probability of Type I error) = 0.25.

In the end, you have 10 candidate variables for possible inclusion in your multiple logistic regression model:

   State

   Intl_Plan

   D_VMAIL_PLAN

   VMail_Message

   Day_Mins

   Eve_Mins

   Night_Mins

   Intl_Mins

   Intl_Calls

   CustServ_Call

Remember that the first three of these variables (the first row) should be set to nominal, and the rest to continuous. (Of course, leave the dependent variable Churn as nominal!)

Run the Initial Regression and Examine the Results

Run your initial multiple logistic regression with Churn as the dependent variable and the above 10 variables as independent variables:

1.   Select AnalyzeFit ModelChurnY.

2.   Select the above 10 variables. (To select variables that are not consecutive, click on each variable while holding down the Ctrl key.) Click Add.

3.   Check the box next to Keep dialog open.

4.   Click Run.

The effect summary report, as shown in Figure 6.19, is similar to the effect summary report discussed in Chapter 5 in Figure 5.6, for multiple regression. Likewise here, the report identifies the significant effects of the independent variables in the model.

The Whole Model Test lets you know that your included variables have an effect on the Churn and a p-value less than .0001, as shown in Figure 6.19.

Figure 6.19: Whole Model Test and Lack of Fit for the Churn Data Set

image

The lack-of-fit test tells you that you have done a good job explaining Churn. From the Lack of Fit, you see that −LogLikelihood for the Full model is 1037.4471. Now, linear regression minimizes the sum of squared residuals. So when you compare two linear regressions, the preferred one has the smaller sum of squared residuals. In the same way, the nonlinear optimization of the logistic regression minimizes the −LogLikelihood (which is equivalent to maximizing the LogLikelihood). So the model with the smaller −LogLikelihood is preferred to a model with a larger −LogLikelihood.

Examining the p-values of the independent variables in the Parameter Estimates, you find that a variable for which Prob>ChiSq is less than 0.05 is said to be significant. Otherwise, it is said to be insignificant, similar to what is practiced in linear regression. The regression output gives two sets of tests, one for the “Parameter Estimates” and another for “Effect Likelihood Ratio Tests.” The following focuses on the latter. To see why, consider the State variable, which is really not one variable but many dummy variables. You are not so much interested in whether any particular state is significant or not (which is what the Parameter Estimates tell you) but whether, overall, the collection of state dummy variables is significant. This is what the Effect Likelihood Ratio Tests tells you: The effect of all the state dummy variables is significant with a Prob>ChiSq of 0.0010. True, many of the State dummies are insignificant, but overall State is significant. You will keep this variable as it is. It might prove worthwhile to reduce the number of state dummy variables into a handful of significant states and small clusters of “other” states that are not significant, but you will not pursue this line of inquiry here.

You can see that all the variables in the model are significant. You might be able to derive some new variables that help improve the model. Below are two examples of deriving new variables: (1) converting a continuous variable into discrete variables and (2) producing interaction variables.

Convert a Continuous Variable to Discrete Variables

Try to break up a continuous variable into a handful of discrete variables. An obvious candidate is CustServ_Call. Look at its distribution in Figure 6.20:

1.   Select AnalyzeDistribution.

2.   Select CustServ_CallY, Columns, and click OK.

3.   Click the red triangle next to CustServ_Call, and uncheck Outlier Box Plot.

4.   Then select Histogram OptionsShow Counts.

Figure 6.20: Histogram of CustServ_Call

image

Now create a new nominal variable called CustServ, so that all the counts for 5 and greater are collapsed into a single cell:

1.   Select ColsNew Columns.

2.   For column name, enter CustServ. Change Modeling Type to Nominal. Then click the drop-down menu for Column Properties, and click Formula.

3.   In the Formula dialog box, select ConditionalIf. Then, in the top left, click CustServ_Call and enter <=4.

4.   In the top then clause, click CustServ_Call. For the else clause, enter 5. See Figure 6.21.

5.   Click OK, and click OK.

Figure 6.21: Creating the CustServ Variable

image

Now drop the CustServ_Call variable from the Logistic Regression and add the new CustServ nominal variable, which is equivalent to adding some dummy variables. Your new value of −LogLikelihood is 970.6171, which constitutes a very substantial improvement in the model.

Produce Interaction Variables

Another possible important way to improve a model is to introduce interactions terms–that is, the product of two or more variables. Best practice would be to consult with subject-matter experts and seek their advice. Some thought is necessary to determine meaningful interactions, but it can pay off in substantially improved models.

Thinking about what might make a cell phone customer want to switch to another carrier, you have all heard a friend complain about being charged an outrageous amount for making an international call. Based on this observation, you might conjecture that customers who make

international calls and who are not on the international calling plan might be more irritated and more likely to churn. A quick bivariate analysis shows that there are more than a few such persons in the data set:

1.   Select AnalyzeTabulate, and drag Intl_Plan to Drop zone for columns.

2.   Drag Intl_Calls to Drop zone for rows.

3.   Right-click Int_Calls in the table, and choose Use as Grouping column.

Observe that almost all customers make international calls, but most of them are not on the international plan (which gives cheaper rates for international calls). For example, for the customers who made no international call, all 18 of them were not on the international calling plan. For the customers who made 8 international calls, 106 were not on the international calling plan, and only 10 of them were. It seems there is quite the potential for irritated customers here. However, this is not confirmed when you examine the output from the previous logistic regression. The parameter estimate for Intl_Plan[no] is positive and significant. This means that, when a customer does not have an international plan, the probability that he or she does not churn increases.

Customers who make international calls and don’t get the cheap rates are perhaps more likely to churn than customers who make international calls and get cheap rates. Hence, the interaction term Int_Plan * Intl_Mins might be important. To create this interaction term, you have to create a new dummy variable for Intl_Plan, because the present variable is not numeric and cannot be multiplied by Intl_Mins:

   1.   Click on the Intl_Plan column in the data table to select it.

   2.   Select ColsRecode.

   3.   Under New Value, where it has No, enter 0, and right below that where it has Yes, enter 1.

   4.   On the Done menu, select New Column and click OK. The new variable Intl_Plan2 is created. However, it is still nominal.

   5.   Right-click on this column. Under Column Info, change the Data Type to Numeric and the Modeling Type to Continuous. Click OK. (This variable has to be continuous so that you can use it in the interaction term, which is created by multiplication. Nominal variables cannot be multiplied.)

To create the interaction term, complete the following steps:

1.   Select ColsNew Columns and call the new variable IntlPlanMins.

2.   Under Column Properties, click Formula.

3.   Click Intl_Plan2, click the times sign (x) in the middle of the dialog box, click Intl_Mins, and click OK.

4.   Click OK again.

Now add the variable IntlPlanMins as the 11th independent variable in multiple logistic regression that includes CustServ and run it. The variable IntlPlanMins is significant, and the –LogLikelihood has dropped to 947.1450, as shown in Figure 6.22. This is a substantial drop for adding one variable. Undoubtedly, other useful interaction terms could be added to this model.

Figure 6.22: Logistic Regression Results with Interaction Term Added

image

Validate and Use the Model

Now that you have built an acceptable model, it is time to validate the model. You have already checked the Lack of Fit, but now you have to check linearity of the logit. From the red arrow, click Save Probability Formula, which adds four variables to the data set: Lin[0] (which is the logit), Prob[0], Prob[1], and the predicted value of Churn, Most Likely Churn.

Now you have to plot the logit against each of the continuous independent variables. The categorical independent variables do not offer much opportunity to reveal nonlinearity (plot some, and see this for yourself). All the relationships of the continuous variables can be quickly viewed by generating a scatterplot matrix and then clicking the red triangle and Fit Line. Nearly all the red fitted lines are horizontal or near horizontal. For all of the logit versus independent variable plots, there is no evidence of nonlinearity.

You can also see how well your model is predicting by examining the confusion matrix, which is shown in Figure 6.23.

Figure 6.23: Confusion Matrix

image

The actual number of churners in the data set is 326 + 157 = 483. The model predicted a total of 258 (= 101 +157) churners. The number of bad predictions made by the model is 326 + 101 = 427, which indicates that 326 that were predicted not to churn actually did churn, and 101 that were predicted to churn did not churn. Further, observe in the Prob[1] column of the data table that you have the probability that any customer will churn. Right-click this column, and select Sort. This will sort all the variables in the data set according to the probability of churning. Scroll to the top of the data set.

Look at the Churn column. It has mostly ones and some zeros here at the top, where the probabilities are all above 0.85. Scroll all the way to the bottom and see that the probabilities now are all below 0.01, and the values of Churn are all zero. You really have modeled the probability of churning.

Now that you have built a model for predicting churn, how might you use it? You could take the next month’s data (when you do not yet know who has churned) and predict who is likely to churn. Then these customers can be offered special deals to keep them with the company, so that they do not churn.

Exercises

1.   Consider the logistic regression for the toy data set, where π is the probability of passing the class:

log[ˆπ1ˆπ]=25.601880.363761 MidtermScore

Consider two students, one who scores 67% on the midterm and one who scores 73% on the midterm. What are the odds that each fails the class? What is the probability that each fails the class?

Consider the first logistic regression for the Churn data set, the one with 10 independent variables. Consider two customers, one with an international plan and one without. What are the odds that each churns? What is the probability that each churns?

2.   You have already found that the interaction term IntlPlanMins significantly improves the model. Find another interaction term that does so.

3.   Without deriving new variables such as CustServ or creating interaction terms such as IntlPlanMins, use a stepwise method to select variables for the Churn data set. Compare your results to the bivariate method used in the chapter; pay particular attention to the fit of the model and the confusion matrix.

4.   Use the Freshmen1.jmp data set and build a logistic regression model to predict whether a student returns. Perhaps the continuous variables Miles from Home and Part Time Work Hours do not seem to have an effect. See whether turning them into discrete variables makes a difference. (In essence, turn Miles from Home into some dummy variables, such as 0‒20 miles, 21‒100 miles, or more than 100 miles.)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset