Chapter 9
Multiple Regression and Model Building

9.1 An Example of Multiple Regression

Chapter 8 examined regression modeling for the simple linear regression case of a single predictor and a single response. Clearly, however, data miners and predictive analysts are usually interested in the relationship between the target variable and a set of (two or more) predictor variables. Most data mining applications enjoy a wealth of data, with some data sets including hundreds or thousands of variables, many of which may have a linear relationship with the target (response) variable. Multiple regression modeling provides an elegant method of describing such relationships. Compared to simple linear regression, multiple regression models provide improved precision for estimation and prediction, analogous to the improved precision of regression estimates over univariate estimates. A multiple regression model uses a linear surface, such as a plane or hyperplane, to approximate the relationship between a continuous response (target) variable, and a set of predictor variables. While the predictor variables are typically continuous, categorical predictor variables may be included as well, through the use of indicator (dummy) variables.

In simple linear regression, we used a straight line (of dimension 1) to approximate the relationship between the response and one predictor. Now, suppose we would like to approximate the relationship between a response and two continuous predictors. In this case, we would need a plane to approximate such a relationship, because a plane is linear in two dimensions.

For example, returning to the cereals data set, suppose we are interested in trying to estimate the value of the target variable, nutritional rating, but this time using two variables, sugars and fiber, rather than sugars alone as in Chapter 8.1 The three-dimensional scatter plot of the data is shown in Figure 9.1. High fiber levels seem to be associated with high nutritional rating, while high sugar levels seem to be associated with low nutritional rating.

c09f001

Figure 9.1 A plane approximates the linear relationship between one response and two continuous predictors.

These relationships are approximated by the plane that is shown in Figure 9.1, in a manner analogous to the straight-line approximation for simple linear regression. The plane tilts downward to the right (for high sugar levels) and toward the front (for low fiber levels).

We may also examine the relationship between rating and its predictors, sugars, and fiber, one at a time, as shown in Figure 9.2. This more clearly illustrates the negative relationship between rating and sugars and the positive relationship between rating and fiber. The multiple regression should reflect these relationships.

c09f002

Figure 9.2 Individual variable scatter plots of rating versus sugars and fiber.

Let us examine the results (Table 9.1) of a multiple regression of nutritional rating on both predictor variables. The regression equation for multiple regression with two predictor variables takes the form:

equation

For a multiple regression with m variables, the regression equation takes the form:

equation

From Table 9.1, we have

  • c09-math-0003
  • c09-math-0004
  • c09-math-0005
  • c09-math-0006
  • c09-math-0007

Thus, the regression equation for this example is

equation

That is, the estimated nutritional rating equals 52.174 minus 2.2436 times the grams of sugar plus 2.8665 times the grams of fiber. Note that the coefficient for sugars is negative, indicating a negative relationship between sugars and rating, while the coefficient for fiber is positive, indicating a positive relationship. These results concur with the characteristics of the graphs in Figures 9.1 and 9.2. The straight lines shown in Figure 9.2 represent the value of the slope coefficients for each variable, −2.2436 for sugars and 2.8665 for fiber.

Table 9.1 Results from regression of nutritional rating on sugars and fiber

c09t001

The interpretations of the slope coefficients c09-math-0009 and c09-math-0010 are slightly different than for the simple linear regression case. For example, to interpret c09-math-0011, we say that “the estimated decrease in nutritional rating for a unit increase in sugar content is 2.2436 points, when fiber content is held constant.” Similarly, we interpret c09-math-0012 as follows: “the estimated increase in nutritional rating for a unit increase in fiber content is 2.8408 points, when sugar content is held constant.” In general, for a multiple regression with m predictor variables, we would interpret coefficient c09-math-0013 as follows: “the estimated change in the response variable for a unit increase in variable c09-math-0014 is c09-math-0015, when all other predictor variables are held constant.”

Recall that errors in prediction are measured by the residual, c09-math-0016. In simple linear regression, this residual represented the vertical distance between the actual data point and the regression line. In multiple regression, the residual is represented by the vertical distance between the data point and the regression plane or hyperplane.

For example, Spoon Size Shredded Wheat has c09-math-0017 grams of sugar, c09-math-0018 grams of fiber, and a nutritional rating of 72.8018. The estimated regression equation would predict, however, that the nutritional rating for this cereal would be

equation

Therefore, we have a residual for Spoon Size Shredded Wheat of c09-math-0020, illustrated in Figure 9.3. As the residual is positive, the data value lies above the regression plane.

c09f003

Figure 9.3 Estimation error is the vertical distance between the actual data point and the regression plane or hyperplane.

Each observation has its own residual, which, taken together, leads to the calculation of the sum of squares error (SSE) as an overall measure of the estimation errors. Just as for the simple linear regression case, we may again calculate the three sums of squares, as follows:

  • c09-math-0021
  • c09-math-0022
  • c09-math-0023

We may again present the regression statistics succinctly in a convenient analysis of variance (ANOVA) table, shown here in Table 9.2, where m represents the number of predictor variables. Finally, for multiple regression, we have the so-called multiple coefficient of determination,2 which is simply

equation

For multiple regression, c09-math-0027 is interpreted as the proportion of the variability in the target variable that is accounted for by its linear relationship with the set of predictor variables.

Table 9.2 The ANOVA table for multiple regression

Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Regression SSR m c09-math-0028 c09-math-0029
Error (or residual) SSE c09-math-0030 c09-math-0031
Total c09-math-0032 c09-math-0033

From Table 9.1, we can see that the value of c09-math-0034 is 81.6%, which means that 81.6% of the variability in nutritional rating is accounted for by the linear relationship (the plane) between rating and the set of predictors, sugar content and fiber content. Now, would we expect c09-math-0035 to be greater than the value for the coefficient of determination we got from the simple linear regression of nutritional rating on sugars alone? The answer is yes. Whenever a new predictor variable is added to the model, the value of c09-math-0036 always goes up. If the new variable is useful, the value of c09-math-0037 will increase significantly; if the new variable is not useful, the value of c09-math-0038 may barely increase at all.

Table 8.7, here reproduced as Table 9.3, provides us with the coefficient of determination for the simple linear regression case, c09-math-0039. Thus, by adding the new predictor, fiber content, to the model, we can account for an additional c09-math-0040 of the variability in the nutritional rating. This seems like a significant increase, but we shall defer this determination until later.

Table 9.3 Results for regression of nutritional rating versus sugar content alone

c09t003

The typical error in estimation is provided by the standard error of the estimate, s. The value of s here is about 6.13 rating points. Therefore, our estimation of the nutritional rating of the cereals, based on sugar and fiber content, is typically in error by about 6.13 points. Now, would we expect this error to be greater or less than the value for s obtained by the simple linear regression of nutritional rating on sugars alone? In general, the answer depends on the usefulness of the new predictor. If the new variable is useful, then s will decrease, but if the new variable is not useful for predicting the target variable, then s may in fact increase. This type of behavior makes s, the standard error of the estimate, a more attractive indicator than c09-math-0041 of whether a new variable should be added to the model, because c09-math-0042 always increases when a new variable is added, regardless of its usefulness.

Table 9.3 shows that the value for s from the regression of rating on sugars alone was about 9.17. Thus, the addition of fiber content as a predictor decreased the typical error in estimating nutritional content from 9.17 points to 6.13 points, a decrease of 3.04 points. Thus, adding a second predictor to our regression analysis decreased the prediction error (or, equivalently, increased the precision) by about three points.

Next, before we turn to inference in multiple regression, we first examine the details of the population multiple regression equation.

9.2 The Population Multiple Regression Equation

We have seen that, for simple linear regression, the regression model takes the form:

9.1 equation

with c09-math-0044 and c09-math-0045 as the unknown values of the true regression coefficients, and c09-math-0046 the error term, with its associated assumption discussed in Chapter 8. The multiple regression model is a straightforward extension of the simple linear regression model in equation (9.1), as follows.

Just as we did for the simple linear regression case, we can derive four implications for the behavior of the response variable, y, as follows.

9.3 Inference in Multiple Regression

We shall examine five inferential methods in this chapter, which are as follows:

  1. The t-test for the relationship between the response variable c09-math-0067 and a particular predictor variable c09-math-0068, in the presence of the other predictor variables, c09-math-0069, where c09-math-0070 denotes the set of all predictors, not including c09-math-0071.
  2. The F-test for the significance of the regression as a whole.
  3. The confidence interval, c09-math-0072, for the slope of the ith predictor variable.
  4. The confidence interval for the mean of the response variable c09-math-0073, given a set of particular values for the predictor variables c09-math-0074.
  5. The prediction interval for a random value of the response variable c09-math-0075, given a set of particular values for the predictor variables c09-math-0076.

9.3.1 The t-Test for the Relationship Between y and xi

The hypotheses for this test are given by

equation

The models implied by these hypotheses are given by

equation

Note that the only difference between the two models is the presence or absence of the ith term. All other terms are the same in both models. Therefore, interpretations of the results for this t-test must include some reference to the other predictor variables being held constant.

Under the null hypothesis, the test statistic c09-math-0079 follows a t distribution with nm − 1 degrees of freedom, where c09-math-0080 refers to the standard error of the slope for the ith predictor variable. We proceed to perform the t-test for each of the predictor variables in turn, using the results displayed in Table 9.1.

9.3.2 t-Test for Relationship Between Nutritional Rating and Sugars

  • c09-math-0081
  • c09-math-0082
  • In Table 9.1, under “Coef” in the “Sugars” row is found the value, c09-math-0083.
  • Under “SE Coef” in the “Sugars” row is found the value of the standard error of the slope for sugar content, c09-math-0084.
  • Under “T” is found the value of the t-statistic; that is, the test statistic for the t-test, c09-math-0085.
  • Under “P” is found the p-value of the t-statistic. As this is a two-tailed test, this p-value takes the following form: c09-math-0086, where c09-math-0087 represents the observed value of the t-statistic from the regression results. Here, c09-math-0088, although, of course, no continuous p-value ever precisely equals zero.

The p-value method is used, whereby the null hypothesis is rejected when the p-value of the test statistic is small. Here, we have p-value c09-math-0089, which is smaller than any reasonable threshold of significance. Our conclusion is therefore to reject the null hypothesis. The interpretation of this conclusion is that there is evidence for a linear relationship between nutritional rating and sugar content, in the presence of fiber content.

9.3.3 t-Test for Relationship Between Nutritional Rating and Fiber Content

  • c09-math-0090
  • c09-math-0091
  • In Table 9.1, under “Coef” in the “Fibers” row is found the value, c09-math-0092.
  • Under “SE Coef” in the “Fiber” row is found the value of the standard error of the slope for fiber, c09-math-0093.
  • Under “T” is found the test statistic for the t-test, c09-math-0094.
  • Under “P” is found the p-value of the t-statistic. Again, c09-math-0095.

Thus, our conclusion is again to reject the null hypothesis. We interpret this to mean that there is evidence for a linear relationship between nutritional rating and fiber content, in the presence of sugar content.

9.3.4 The F-Test for the Significance of the Overall Regression Model

Next, we introduce the F-test for the significance of the overall regression model. Figure 9.4 illustrates the difference between the t-test and the F-test. One may apply a separate t-test for each predictor c09-math-0096, or c09-math-0097, examining whether a linear relationship exists between the target variable y and that particular predictor. However, the F-test considers the linear relationship between the target variable y and the set of predictors (e.g., c09-math-0098), taken as a whole.

c09f004

Figure 9.4 The F-test considers the relationship between the target and the set of predictors, taken as a whole.

The hypotheses for the F-test are given by

equation

The null hypothesis asserts that there is no linear relationship between the target variable y, and the set of predictors, c09-math-0100. Thus, the null hypothesis states that the coefficient c09-math-0101 for each predictor c09-math-0102 exactly equals zero, leaving the null model to be

equation

The alternative hypothesis does not assert that all the regression coefficients differ from zero. For the alternative hypothesis to be true, it is sufficient for a single, unspecified, regression coefficient to differ from zero. Hence, the alternative hypothesis for the F-test does not specify a particular model, because it would be true if any, some, or all of the coefficients differed from zero.

As shown in Table 9.2, the F-statistic consists of a ratio of two means squares, the mean square regression (MSR) and the mean square error (MSE). A mean square represents a sum of squares divided by the degrees of freedom associated with that sum of squares statistic. As the sums of squares are always nonnegative, then so are the mean squares. To understand how the F-test works, we should consider the following.

The MSE is always a good estimate of the overall variance (see model assumption 2) c09-math-0104, regardless of whether the null hypothesis is true or not. (In fact, recall that we use the standard error of the estimate, c09-math-0105, as a measure of the usefulness of the regression, without reference to an inferential model.) Now, the MSR is also a good estimate of c09-math-0106, but only on the condition that the null hypothesis is true. If the null hypothesis is false, then MSR overestimates c09-math-0107.

So, consider the value of c09-math-0108, with respect to the null hypothesis. Suppose MSR and MSE are close to each other, so that the value of F is small (near 1.0). As MSE is always a good estimate of c09-math-0109, and MSR is only a good estimate of c09-math-0110 when the null hypothesis is true, then the circumstance that MSR and MSE are close to each other will only occur when the null hypothesis is true. Therefore, when the value of F is small, this is evidence that the null hypothesis is true.

However, suppose that MSR is much greater than MSE, so that the value of F is large. MSR is large (overestimates c09-math-0111) when the null hypothesis is false. Therefore, when the value of F is large, this is evidence that the null hypothesis is false. Therefore, for the F-test, we shall reject the null hypothesis when the value of the test statistic F is large.

The observed F-statistic c09-math-0112 follows an c09-math-0113 distribution. As all F values are nonnegative, the F-test is a right-tailed test. Thus, we will reject the null hypothesis when the p-value is small, where the p-value is the area in the tail to the right of the observed F statistic. That is, c09-math-0114, and we reject the null hypothesis when c09-math-0115 is small.

9.3.5 F-Test for Relationship Between Nutritional Rating and {Sugar and Fiber}, Taken Together

  • c09-math-0116.
  • c09-math-0117
  • The model implied by Ha is not specified, and may be any one of the following:
    • c09-math-0118
    • c09-math-0119
    • c09-math-0120.
  • In Table 9.1, under “MS” in the “Regression” row of the “Analysis of Variance” table, is found the value of MSR, 6094.3.
  • Under “MS” in the “Residual Error” row of the “Analysis of Variance” table is found the value of MSE, 37.5.
  • Under “F” in the “Regression” row of the “Analysis of Variance” table is found the value of the test statistic c09-math-0121.
  • The degrees of freedom for the F-statistic are given in the column marked “DF,” so that we have c09-math-0122, and c09-math-0123.
  • Under “P” in the “Regression” row of the “Analysis of Variance” table is found the p-value of the F-statistic. Here, the p-value is c09-math-0124, although again no continuous p-value ever precisely equals zero.

This p-value of approximately zero is less than any reasonable threshold of significance. Our conclusion is therefore to reject the null hypothesis. The interpretation of this conclusion is the following. There is evidence for a linear relationship between nutritional rating on the one hand, and the set of predictors, sugar content and fiber content, on the other. More succinctly, we may simply say that the overall regression model is significant.

9.3.6 The Confidence Interval for a Particular Coefficient, βi

Just as for simple linear regression, we may construct a c09-math-0125 confidence interval for a particular coefficient, c09-math-0126, as follows. We can be c09-math-0127 confident that the true value of a particular coefficient c09-math-0128 lies within the following interval:

equation

where c09-math-0130 is based on c09-math-0131 degrees of freedom, and c09-math-0132 represents the standard error of the ith coefficient estimate.

For example, let us construct a 95% confidence interval for the true value of the coefficient c09-math-0133 for c09-math-0134, sugar content. From Table 9.1, the point estimate is given as c09-math-0135. The t-critical value for 95% confidence and c09-math-0136 degrees of freedom is c09-math-0137. The standard error of the coefficient estimate is c09-math-0138. Thus, our confidence interval is as follows:

equation

We are 95% confident that the value for the coefficient c09-math-0140 lies between −2.57 and −1.92. In other words, for every additional gram of sugar, the nutritional rating will decrease by between 1.92 and 2.57 points, when fiber content is held constant. For example, suppose a nutrition researcher claimed that nutritional rating would fall two points for every additional gram of sugar, when fiber is held constant. As −2.0 lies within the 95% confidence interval, then we would not reject this hypothesis, with 95% confidence.

9.3.7 The Confidence Interval for the Mean Value of y, Given x1, x2, …, xm

We may find confidence intervals for the mean value of the target variable y, given a particular set of values for the predictors c09-math-0141. The formula is a multivariate extension of the analogous formula from Chapter 8, requires matrix multiplication, and may be found in Draper and Smith.3 For example, the bottom of Table 9.1 (“Values of Predictors for New Observations”) shows that we are interested in finding the confidence interval for the mean of the distribution of all nutritional ratings, when the cereal contains 5.00 grams of sugar and 5.00 grams of fiber.

The resulting 95% confidence interval is given, under “Predicted Values for New Observations,” as “95% CI” = (53.062, 57.516). That is, we can be 95% confident that the mean nutritional rating of all cereals with 5.00 grams of sugar and 5.00 grams of fiber lies between 55.062 points and 57.516 points.

9.3.8 The Prediction Interval for a Randomly Chosen Value of y, Given x1, x2, …, xm

Similarly, we may find a prediction interval for a randomly selected value of the target variable, given a particular set of values for the predictors c09-math-0142. We refer to Table 9.1 for our example of interest: 5.00 grams of sugar and 5.00 grams of fiber. Under “95% PI,” we find the prediction interval to be (42,876, 67.702). In other words, we can be 95% confident that the nutritional rating for a randomly chosen cereal with 5.00 grams of sugar and 5.00 grams of fiber lies between 42.876 points and 67.702 points. Again, note that the prediction interval is wider than the confidence interval, as expected.

9.4 Regression With Categorical Predictors, Using Indicator Variables

Thus far, our predictors have all been continuous. However, categorical predictor variables may also be used as inputs to regression models, through the use of indicator variables (dummy variables). For example, in the cereals data set, consider the variable shelf, which indicates which supermarket shelf the particular cereal was located on. Of the 76 cereals, 19 were located on shelf 1, 21 were located on shelf 2, and 36 were located on shelf 3.

A dot plot of the nutritional rating for the cereals on each shelf is provided in Figure 9.5, with the shelf means indicated by the triangles. Now, if we were to use only the categorical variables (such as shelf and manufacturer) as predictors, then we could perform ANOVA.4 However, we are interested in using the categorical variable shelf along with continuous variables such as sugar content and fiber content. Therefore, we shall use multiple regression analysis with indicator variables.

c09f005

Figure 9.5 Is there evidence that shelf location affects nutritional rating?

On the basis of comparison dot plot in Figure 9.5, does there seem to be evidence that shelf location affects nutritional rating? It would seem that shelf 2 cereals, with their average nutritional rating of 34.97, seem to lag somewhat behind the cereals on shelf 1 and shelf 3, with their respective average nutritional ratings of 45.90 and 45.22. However, it is not clear whether this difference is significant. Further, this dot plot does not take into account the other variables, such as sugar content and fiber

content; it is unclear how any “shelf effect” would manifest itself, in the presence of these other variables.

For use in regression, a categorical variable with k categories must be transformed into a set of k − 1 indicator variables. An indicator variable, also known as a flag variable, or a dummy variable, is a binary 0/1 variable, which takes the value 1 if the observation belongs to the given category, and takes the value 0 otherwise.

For the present example, we define the following indicator variables:

equation
equation

Table 9.4 indicates the values taken by these indicator variables, for cereals located on shelves 1, 2, and 3, respectively. Note that it is not necessary to define a third indicator variable “shelf 3,” because cereals located on shelf 3 will have zero values for each of the shelf 1 and shelf 2 indicator variables, and this is sufficient to distinguish them. In fact, one should not define this third dummy variable because the resulting covariate matrix will be singular, and the regression will not work. The category that is not assigned an indicator variable is denoted the reference category. Here, shelf 3 is the reference category. Later, we shall measure the effect of the location of a given cereal (e.g., on shelf 1) on nutritional rating, with respect to (i.e., with reference to) shelf 3, the reference category.

Table 9.4 Values taken by the indicator variables, for cereals located on shelves 1, 2, and 3, respectively

Cereal Location Value of Variable Shelf 1 Value of Variable Shelf 2
Shelf 1 1 0
Shelf 2 0 1
Shelf 3 0 0

So, let us construct a multiple regression model using only the two indicator variables shown in Table 9.4. In this case, our regression equation is

equation

Before we run the regression, let us think about what the regression coefficient values might be. On the basis of Figure 9.5, we would expect c09-math-0146 to be negative, because the shelf 2 cereals have a lower mean rating, compared to shelf 3 cereals. We might also expect c09-math-0147 to be essentially negligible but slightly positive, reflecting the slightly greater mean rating for shelf 1 cereals, compared to with shelf 3 cereals.

Table 9.5 contains the results of the regression of nutritional rating on shelf 1 and shelf 2 only. Note that the coefficient for the shelf 2 dummy variable is −10.247, which is equal (after rounding) to the difference in the mean nutritional ratings between cereals on shelves 2 and 3: 34.97 − 45.22. Similarly, the coefficient for the shelf 1 dummy variable is 0.679, which equals (after rounding) the difference in the mean ratings between cereals on shelves 1 and 3: 45.90 − 45.22. These values fulfill our expectations, based on Figure 9.5.

Table 9.5 Results of regression of nutritional rating on shelf location only

c09t005

Next, let us proceed to perform multiple regression, for the linear relationship between nutritional rating and sugar content, fiber content, and shelf location, using the two dummy variables from Table 9.4. The regression equation is given as

equation

For cereals located on shelf 1, regression equation looks like the following:

equation

For cereals located on shelf 2, the regression equation is

equation

Finally, for cereals located on shelf 3, the regression equation is as follows:

equation

Note the relationship of the model equations to each other. The three models represent parallel planes, as illustrated in Figure 9.6. (Note that the planes do not, of course, directly represent the shelves themselves, but the fit of the regression model to the nutritional rating, for the cereals on the various shelves.) The results for the regression of nutritional rating on sugar content, fiber content, and shelf location are provided in Table 9.6. The general form of the regression equation looks like:

equation

Thus, the regression equation for cereals located on the various shelves is given as the following:

equation

Note that these estimated regression equations are exactly the same, except for the y-intercept. This means that cereals on each shelf are modeled as following the exact same slope in the sugars dimension (−2.3183) and the exact same slope in the fiber dimension (3.1314), which gives us the three parallel planes shown in Figure 9.6. The only difference lies in the value of the y-intercept for the cereals on the three shelves.

c09f006

Figure 9.6 The use of indicator variables in multiple regression leads to a set of parallel planes (or hyperplanes).

Table 9.6 Results for the regression of nutritional rating on sugar content, fiber content, and shelf location

c09t006

The reference category in this case is shelf 3. What is the vertical distance between the shelf 3 plane and, for example, the shelf 1 plane? Note from the derivations above that the estimated regression equation for the cereals on shelf 1 is given as

equation

so that the y-intercept is c09-math-0155. We also have the estimated regression equation for the cereals on shelf 3 to be

equation

Thus, the difference between the y-intercepts is c09-math-0157. We can verify this by noting that c09-math-0158, which is the value of c09-math-0159 reported in Table 9.6. The vertical distance between the planes representing shelves 1 and 3 is everywhere 2.101 rating points, as shown in Figure 9.7.

c09f007

Figure 9.7 The indicator variables coefficients estimate the difference in the response value, compared to the reference category.

Of particular importance is the interpretation of this value for c09-math-0160. Now, the y-intercept represents the estimated nutritional rating when both sugars and fiber equal zero. However, as the planes are parallel, the difference in the y-intercepts among the shelves remains constant throughout the range of sugar and fiber values. Thus, the vertical distance between the parallel planes, as measured by the coefficient for the indicator variable, represents the estimated effect of the particular indicator variable on the target variable, with respect to the reference category.

In this example, c09-math-0161 represents the estimated difference in nutritional rating for cereals located on shelf 1, compared to the cereals on shelf 3. As c09-math-0162 is positive, this indicates that the estimated nutritional rating for shelf 1 cereals is higher. We thus interpret c09-math-0163 as follows: The estimated increase in nutritional rating for cereals located on shelf 1, as compared to cereals located on shelf 3, is c09-math-0164 points, when sugars and fiber content are held constant. It is similar for the cereals on shelf 2. We have the estimated regression equation for these cereals as:

equation

so that the difference between the y-intercepts for the planes representing shelves 2 and 3 is c09-math-0166. We thus have c09-math-0167, which is the value for c09-math-0168 reported in Table 9.6. That is, the vertical distance between the planes representing shelves 2 and 3 is everywhere 3.915 rating points, as shown in Figure 9.7. Therefore, the estimated increase in nutritional rating for cereals located on shelf 2, as compared to cereals located on shelf 3, is c09-math-0169 points, when sugars and fiber content are held constant.

We may then infer the estimated difference in nutritional rating between shelves 2 and 1. This is given as c09-math-0170 points. The estimated increase in nutritional rating for cereals located on shelf 2, as compared to cereals located on shelf 1, is 1.814 points, when sugars and fiber content are held constant.

Now, recall Figure 9.5, where we encountered evidence that shelf 2 cereals had the lowest nutritional rating, with an average of about 35, compared to average ratings of 46 and 45 for the cereals on the other shelves. How can this knowledge be reconciled with the dummy variable results, which seem to show the highest rating for shelf 2?

The answer is that our indicator variable results are accounting for the presence of the other variables, sugar content and fiber content. It is true that the cereals on shelf 2 have the lowest nutritional rating; however, as shown in Table 9.7, these cereals also have the highest sugar content (average 9.62 grams, compared to 5.11 and 6.53 grams for shelves 1 and 3) and the lowest fiber content (average 0.91 grams, compared to 1.63 and 3.14 grams for shelves 1 and 3). Because of the negative correlation between sugar and rating, and the positive correlation between fiber and rating, the shelf 2 cereals already have a relatively low estimated nutritional rating based on these two predictors alone.

Table 9.7 Using sugars and fiber only, the regression model underestimates the nutritional rating of shelf 2 cereals

Shelf Mean Sugars Mean Fiber Mean Rating Mean Estimated Ratinga Mean Error
1 5.11 1.63 45.90 45.40 −0.50
2 9.62 0.91 34.97 33.19 −1.78
3 6.53 3.14 45.22 46.53 +1.31

a Rating estimated using sugars and fiber only, and not shelf location.5

Table 9.7 shows the mean fitted values (estimated ratings) for the cereals on the various shelves, when sugar and fiber content are included in the model, but shelf location is not included as a predictor. Note that, on average, the nutritional rating of the shelf 2 cereals is underestimated by 1.78 points. However, the nutritional rating of the shelf 3 cereals is overestimated by 1.31 points. Therefore, when shelf location is introduced into the model, these over-/underestimates can be compensated for. Note from Table 9.7 that the relative estimation error difference between shelves 2 and 3 is 1.31 + 1.78 = 3.09. Thus, we would expect that if shelf location were going to compensate for the underestimate of shelf 2 cereals relative to shelf 3 cereals, it would add a factor in the neighborhood of 3.09 ratings points. Recall from Figure 9.6 that c09-math-0171, which is in the ballpark of 3.09. Also, note that the relative estimation error difference between shelves 1 and 3 is 1.31 + 0.50 = 1.81. We would expect that the shelf indicator variable compensating for this estimation error would be not far from 1.81, and, indeed, we have the relevant coefficient as c09-math-0172.

This example illustrates the flavor of working with multiple regression, in that the relationship of the set of predictors with the target variable is not necessarily dictated by the individual bivariate relationships the target variable has with each of the predictors. For example, Figure 9.5 would have led us to believe that shelf 2 cereals would have had an indicator variable adjusting the estimated nutritional rating downward. But the actual multiple regression model, which included sugars, fiber, and shelf location, had an indicator variable adjusting the estimated nutritional rating upward, because of the effects of the other predictors.

Consider again Table 9.6. Note that the p-values for the sugars coefficient and the fiber coefficient are both quite small (near zero), so that we may include both of these predictors in the model. However, the p-value for the shelf 1 coefficient is somewhat large (0.246), indicating that the relationship between this variable is not statistically significant. In other words, in the presence of sugars and fiber content, the difference in nutritional rating between shelf 1 cereals and shelf 3 cereals is not significant. We may therefore consider eliminating the shelf 1 indicator variable from the model. Suppose we go ahead and eliminate the shelf 1 indicator variable from the model, because of its large p-value, but retain the shelf 2 indicator variable. The results from the regression of nutritional rating on sugar content, fiber content, and shelf 2 (compared to shelf 3) location are given in Table 9.8.

Table 9.8 Results from regression of nutritional rating on sugars, fiber, and the shelf 2 indicator variable

c09t008

Note from Table 9.8 that the p-value for the shelf 2 dummy variable has increased from 0.039 to 0.077, indicating that it may no longer belong in the model. The effect of adding or removing predictors on the other predictors is not always predictable. This is why variable selection procedures exist to perform this task methodically, such as stepwise regression. We cover these methods later in this chapter.

9.5 Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful

Recall that adding a variable to the model will increase the value of the coefficient of determination c09-math-0173, regardless of the usefulness of the variable. This is not a particularly attractive feature of this measure, because it may lead us to prefer models with marginally larger values for c09-math-0174, simply because they have more variables, and not because the extra variables are useful. Therefore, in the interests of parsimony, we should find some way to penalize the c09-math-0175 measure for models that include predictors that are not useful. Fortunately, such a penalized form for c09-math-0176 does exist, and is known as the adjusted c09-math-0177. The formula for adjusted c09-math-0178 is as follows:

equation

If c09-math-0180 is much less than c09-math-0181, then this is an indication that at least one variable in the model may be extraneous, and the analyst should consider omitting that variable from the model.

As an example of calculating c09-math-0182, consider Figure 9.8, where we have

  • c09-math-0183
  • c09-math-0184
  • n = 76
  • m = 4
c09f008

Figure 9.8 Matrix plot of the predictor variables shows correlation between fiber and potassium.

Then, c09-math-0185.

Let us now compare Tables 9.6 and 9.8, where the regression model was run with and without the shelf 1 indicator variable, respectively. The shelf 1 indicator variable was found to be not useful for estimating nutritional rating. How did this affect c09-math-0186 and c09-math-0187?

  • With shelf 1: c09-math-0188
  • Without shelf 1: c09-math-0189

So, the regression model, not including shelf 1, suffers a smaller penalty than does the model that includes it, which would make sense if shelf 1 is not a helpful predictor. However, in this instance, the penalty is not very large in either case. Just remember: When one is building models in multiple regression, one should use c09-math-0190 and s, rather than the raw c09-math-0191.

9.6 Sequential Sums of Squares

Some analysts use the information provided in the sequential sums of squares, provided by many software packages, to help them get a better idea of which variables to include in the model. The sequential sums of squares represent a partitioning of SSR, the regression sum of squares. Recall that SSR represents the proportion of the variability in the target variable that is explained by the linear relationship of the target variable with the set of predictor variables. The sequential sums of squares partition the SSR into the unique portions of the SSR that are explained by the particular predictors, given any earlier predictors. Thus, the values of the sequential sums of squares depend on the order that the variables are entered into the model. For example, the sequential sums of squares for the model:

equation

are found in Table 9.6, and repeated here in Table 9.9. The sequential sum of squares shown for sugars is 8711.9, and represents the variability in nutritional rating that is explained by the linear relationship between rating and sugar content. In other words, this first sequential sum of squares is exactly the value for SSR from the simple linear regression of nutritional rating on sugar content.6

equation

The second sequential sum of squares, for fiber content, equals 3476.6. This represents the amount of unique additional variability in nutritional rating that is explained by the linear relationship of rating with fiber content, given that the variability explained by sugars has already been extracted. The third sequential sum of squares, for shelf 1, is 7.0. This represents the amount of unique additional variability in nutritional rating that is accounted for by location on shelf 1 (compared to the reference class shelf 3), given that the variability accounted for by sugars and fiber has already been separated out. This tiny value for the sequential sum of squares for shelf 1 indicates that the variable is probably not useful for estimating nutritional rating. Finally, the sequential sum of squares for shelf 2 is a moderate 159.9.

Table 9.9 The sequential sums of squares for the model: y = β0 + β1(sugars) + β2(fiber) + β3(Shelf 1) + β4(Shelf 2) + ϵ

c09t009

Now, suppose we changed the ordering of the variables into the regression model. This would change the values of the sequential sums of squares. For example, suppose we perform an analysis based on the following model:

equation

The results for this regression are provided in Table 9.10. Note that all the results in Table 9.10 are exactly the same as in Table 9.6 (apart from ordering), except the values of the sequential sums of squares. This time, the indicator variables are able to “claim” their unique portions of the variability before the other variables are entered, thus giving them larger values for their sequential sums of squares. See Neter, Wasserman, and Kutner7 for more information on applying sequential sums of squares for variable selection. We use the sequential sums of squares, in the context of a partial F-test, to perform variable selection later on in this chapter.

Table 9.10 Changing the ordering of the variables into the model changes nothing except the sequential sums of squares

c09t010

9.7 Multicollinearity

Suppose that we are now interested in adding the predictor potassium to the model, so that our new regression equation looks like:

equation

Now, data miners need to guard against multicollinearity, a condition where some of the predictor variables are correlated with each other. Multicollinearity leads to instability in the solution space, leading to possible incoherent results. For example, in a data set with severe multicollinearity, it is possible for the F-test for the overall regression to be significant, while none of the t-tests for the individual predictors are significant.

Consider Figures 9.9 and 9.10. Figure 9.9 illustrates a situation where the predictors c09-math-0196 and c09-math-0197 are not correlated with each other; that is, they are orthogonal, or independent. In such a case, the predictors form a solid basis, on which the response surface y may rest sturdily, thereby providing stable coefficient estimates c09-math-0198 and c09-math-0199, each with small variability c09-math-0200 and c09-math-0201. However, Figure 9.10 illustrates a multicollinear situation where the predictors c09-math-0202 and c09-math-0203 are correlated with each other, so that as one of them increases, so does the other. In this case, the predictors no longer form a solid basis, on which the response surface may firmly rest. Instead, when the predictors are correlated, the response surface is unstable, providing highly variable coefficient estimates c09-math-0204 and c09-math-0205, because of the inflated values for c09-math-0206 and c09-math-0207.

c09f009

Figure 9.9 When the predictors c09-math-0208 and c09-math-0209 are uncorrelated, the response surface y rests on a solid basis, providing stable coefficient estimates.

c09f010

Figure 9.10 Multicollinearity: When the predictors are correlated, the response surface is unstable, resulting in dubious and highly variable coefficient estimates.

The high variability associated with the estimates means that different samples may produce coefficient estimates with widely different values. For example, one sample may produce a positive coefficient estimate for c09-math-0210, while a second sample may produce a negative coefficient estimate. This situation is unacceptable when the analytic task calls for an explanation of the relationship between the response and the predictors, individually. Even if such instability is avoided, inclusion of variables that are highly correlated tends to overemphasize a particular component of the model, because the component is essentially being double counted.

To avoid multicollinearity, the analyst should investigate the correlation structure among the predictor variables (ignoring for the moment the target variable). Table 9.118 provides the correlation coefficients among the predictors for our present model. For example, the correlation coefficient between sugars and fiber is −0.139, while the correlation coefficient between sugars and potassium is 0.001. Unfortunately, there is one pair of variables that are strongly correlated: fiber and potassium, with r = 0.912. Another method of assessing whether the predictors are correlated is to construct a matrix plot of the predictors, such as Figure 9.8. The matrix plot supports the finding that fiber and potassium are positively correlated.

Table 9.11 Correlation coefficients among the predictors: We have a problem

c09t011

However, suppose we did not check for the presence of correlation among our predictors, and went ahead and performed the regression anyway. Is there some way that the regression results can warn us of the presence of multicollinearity? The answer is yes: We may ask for the variance inflation factors (VIFs) to be reported.

What do we mean by VIFs? First, recall that c09-math-0211 represents the variability associated with the coefficient c09-math-0212 for the ith predictor variable c09-math-0213. We may express c09-math-0214 as a product of s, the standard error of the estimate, and c09-math-0215, which is a constant whose value depends on the observed predictor values. That is, c09-math-0216. Now, s is fairly robust with respect to the inclusion of correlated variables in the model, so, in the presence of correlated predictors, we would look to c09-math-0217 to help explain large changes in c09-math-0218.

We may express c09-math-0219 as the following:

equation

where c09-math-0221 represents the sample variance of the observed values of ith predictor, c09-math-0222, and c09-math-0223 represents the c09-math-0224 value obtained by regressing c09-math-0225 on the other predictor variables. Note that c09-math-0226 will be large when c09-math-0227 is highly correlated with the other predictors.

Note that, of the two terms in c09-math-0228, the first factor c09-math-0229 measures only the intrinsic variability within the ith predictor, c09-math-0230. It is the second factor c09-math-0231 that measures the correlation between the ith predictor c09-math-0232 and the remaining predictor variables. For this reason, this second factor is denoted as the VIF for c09-math-0233:

equation

Can we describe the behavior of the VIF? Suppose that c09-math-0235 is completely uncorrelated with the remaining predictors, so that c09-math-0236. Then we will have c09-math-0237. That is, the minimum value for VIF is 1, and is reached when c09-math-0238 is completely uncorrelated with the remaining predictors. However, as the degree of correlation between c09-math-0239 and the other predictors increases, c09-math-0240 will also increase. In that case, c09-math-0241 will increase without bound, as c09-math-0242 approaches 1. Thus, there is no upper limit to the value that c09-math-0243 can take.

What effect do these changes in c09-math-0244 have on c09-math-0245, the variability of the ith coefficient? We have c09-math-0246. If c09-math-0247 is uncorrelated with the other predictors, then c09-math-0248, and the standard error of the coefficient c09-math-0249 will not be inflated. However, if c09-math-0250 is correlated with the other predictors, then the large c09-math-0251 will produce an inflation of the standard error of the coefficient c09-math-0252. As you know, inflating the variance estimates will result in a degradation in the precision of the estimation. A rough rule of thumb for interpreting the value of the VIF is to consider c09-math-0253 to be an indicator of moderate multicollinearity, and to consider c09-math-0254 to be an indicator of severe multicollinearity. A c09-math-0255 corresponds to c09-math-0256, while c09-math-0257 corresponds to c09-math-0258.

Getting back to our example, suppose we went ahead with the regression of nutritional rating on sugars, fiber, the shelf 2 indicator, and the new variable potassium, which is correlated with fiber. The results, including the observed VIFs, are shown in Table 9.12. The estimated regression equation for this model is

equation

The p-value for potassium is not very small (0.082), so at first glance, the variable may or may not be included in the model. Also, the p-value for the shelf 2 indicator variable (0.374) has increased to such an extent that we should perhaps not include it in the model. However, we should probably not put too much credence into any of these results, because the observed VIFs seem to indicate the presence of a multicollinearity problem. We need to resolve the evident multicollinearity before moving forward with this model.

Table 9.12 Regression results, with variance inflation factors indicating a multicollinearity problem

c09t012

Note that only 74 cases were used, because the potassium content of Almond Delight and Cream of Wheat are missing, along with the sugar content of Quaker Oats.

The VIF for fiber is 6.952 and the VIF for potassium is 7.157, with both values indicating moderate-to-strong multicollinearity. At least the problem is localized with these two variables only, as the other VIFs are reported at acceptably low values.

How shall we deal with this problem? Some texts suggest choosing one of the variables and eliminating it from the model. However, this should be viewed only as a last resort, because the omitted variable may have something to teach us. As we saw in Chapter 4, principal components can be a powerful method for using the correlation structure in a large group of predictors to produce a smaller set of independent components. Principal components analysis is a definite option here. Another option might be to construct a user-defined composite, as discussed in Chapter 4. Here, our user-defined composite will be as simple as possible, the mean of fiberz and potassiumz, where the z-subscript notation indicates that the variables have been standardized. Thus, our composite W is defined as c09-math-0260. Note that we need to standardize the variables involved in the composite, to avoid the possibility that the greater variability of one of the variables will overwhelm that of the other variable. For example, the standard deviation of fiber among all cereals is 2.38 grams, while the standard deviation of potassium is 71.29 milligrams. (The grams/milligrams scale difference is not at issue here. What is relevant is the difference in variability, even on their respective scales.) Figure 9.11 illustrates the difference in variability.9

c09f011

Figure 9.11 Fiber and potassium have different variabilities, thus requiring standardization before construction of user-defined composite.

We therefore proceed to perform the regression of nutritional rating on the following variables:

  • Sugarsz
  • Shelf 2
  • c09-math-0261

The results are provided in Table 9.13.

Table 9.13 Results from regression of rating on sugars, shelf 2, and the fiber/potassium composite

c09t013

Note first that the multicollinearity problem seems to have been resolved, with the VIF values all near 1. Note also, however, that the regression results are rather disappointing, with the values of c09-math-0262, and s all underperforming the model results found in Table 9.8, from the model, c09-math-0263, which did not even include the potassium variable.

What is going on here? The problem stems from the fact that the fiber variable is a very good predictor of nutritional rating, especially when coupled with sugar content, as we shall see later on when we perform best subsets regression. Therefore, using the fiber variable to form a composite with a variable that has weaker correlation with rating dilutes the strength of fiber's strong association with rating, and so degrades the efficacy of the model.

Thus, reluctantly, we put aside this model (c09-math-0264). One possible alternative is to change the weights in the composite, to increase the weight of fiber with respect to potassium. For example, we could use c09-math-0265. However, the model performance would still be slightly below that of using fiber alone. Instead, the analyst may be better advised to pursue principal components.

Now, depending on the task confronting the analyst, multicollinearity may not in fact present a fatal defect. Weiss10 notes that multicollinearity “does not adversely affect the ability of the sample regression equation to predict the response variable.” He adds that multicollinearity does not significantly affect point estimates of the target variable, confidence intervals for the mean response value, or prediction intervals for a randomly selected response value. However, the data miner must therefore strictly limit the use of a multicollinear model to estimation and prediction of the target variable. Interpretation of the model would not be appropriate, because the individual coefficients may not make sense, in the presence of multicollinearity.

9.8 Variable Selection Methods

To assist the data analyst in determining which variables should be included in a multiple regression model, several different variable selection methods have been developed, including

  • forward selection;
  • backward elimination;
  • stepwise selection;
  • best subsets.

These variable selection methods are essentially algorithms to help construct the model with the optimal set of predictors.

9.8.1 The Partial F-Test

In order to discuss variable selection methods, we first need to learn about the partial F-test. Suppose that we already have p variables in the model, c09-math-0266, and we are interested in whether one extra variable c09-math-0267 should be included in the model or not. Recall earlier where we discussed the sequential sums of squares. Here, we would calculate the extra (sequential) sum of squares from adding c09-math-0268 to the model, given that c09-math-0269 are already in the model. Denote this quantity by c09-math-0270. Now, this extra sum of squares is computed by finding the regression sum of squares for the full model (including c09-math-0271 and c09-math-0272), denoted c09-math-0273, and subtracting the regression sum of squares from the reduced model (including only c09-math-0274), denoted c09-math-0275. In other words:

equation

that is,

equation

The null hypothesis for the partial F-test is as follows:

  1. H0: No, the c09-math-0278 associated with c09-math-0279 does not contribute significantly to the regression sum of squares for a model already containing c09-math-0280. Therefore, do not include c09-math-0281 in the model.

The alternative hypothesis is:

  1. Ha: Yes, the c09-math-0282 associated with c09-math-0283 does contribute significantly to the regression sum of squares for a model already containing c09-math-0284. Therefore, do include c09-math-0285 in the model.

The test statistic for the partial F-test is the following:

equation

where c09-math-0287 denotes the mean square error term from the full model, including c09-math-0288 and c09-math-0289. This is known as the partial F-statistic for c09-math-0290. When the null hypothesis is true, this test statistic follows an c09-math-0291 distribution. We would therefore reject the null hypothesis when c09-math-0292 is large, or when its associated p-value is small.

An alternative to the partial F-test is the t-test. Now, an F-test with 1 and c09-math-0293 degrees of freedom is equivalent to a t-test with c09-math-0294 degrees of freedom. This is due to the distributional relationship that c09-math-0295. Thus, either the F-test or the t-test may be performed. Similarly to our treatment of the t-test earlier in the chapter, the hypotheses are given by

equation

The associated models are

equation

Under the null hypothesis, the test statistic c09-math-0298 follows a t distribution with c09-math-0299 degrees of freedom. Reject the null hypothesis when the two-tailed p-value, c09-math-0300, is small.

Finally, we need to discuss the difference between sequential sums of squares, and partial sums of squares. The sequential sums of squares are as described earlier in the chapter. As each variable is entered into the model, the sequential sum of squares represents the additional unique variability in the response explained by that variable, after the variability accounted for by variables entered earlier in the model has been extracted. That is, the ordering of the entry of the variables into the model is germane to the sequential sums of squares.

However, ordering is not relevant to the partial sums of squares. For a particular variable, the partial sum of squares represents the additional unique variability in the response explained by that variable, after the variability accounted for by all the other variables in the model has been extracted. Table 9.14 shows the difference between sequential and partial sums of squares, for a model with four predictors, c09-math-0301.

Table 9.14 The difference between sequential SS and partial SS

Variable Sequential SS Partial SS
c09-math-0302 c09-math-0303 c09-math-0304
c09-math-0305 c09-math-0306 c09-math-0307
c09-math-0308 c09-math-0309 c09-math-0310
c09-math-0311 c09-math-0312 c09-math-0313

9.8.2 The Forward Selection Procedure

The forward selection procedure starts with no variables in the model.

  • Step 1. For the first variable to enter the model, select the predictor most highly correlated with the target. (Without loss of generality, denote this variable c09-math-0314.) If the resulting model is not significant, then stop and report that no variables are important predictors; otherwise, proceed to step 2. Note that the analyst may choose the level of c09-math-0315; lower values make it more difficult to enter the model. A common choice is c09-math-0316, but this is not set in stone.
  • Step 2. For each remaining variable, compute the sequential F-statistic for that variable, given the variables already in the model. For example, in this first pass through the algorithm, these sequential F-statistics would be c09-math-0317, c09-math-0318, and c09-math-0319. On the second pass through the algorithm, these might be c09-math-0320 and c09-math-0321. Select the variable with the largest sequential F-statistic.
  • Step 3. For the variable selected in step 2, test for the significance of the sequential F-statistic. If the resulting model is not significant, then stop, and report the current model without adding the variable from step 2. Otherwise, add the variable from step 2 into the model and return to step 2.

9.8.3 The Backward Elimination Procedure

The backward elimination procedure begins with all the variables, or all of a user-specified set of variables, in the model.

  • Step 1. Perform the regression on the full model; that is, using all available variables. For example, perhaps the full model has four variables, c09-math-0322.
  • Step 2. For each variable in the current model, compute the partial F-statistic. In the first pass through the algorithm, these would be c09-math-0323, c09-math-0324, c09-math-0325, and c09-math-0326. Select the variable with the smallest partial F-statistic. Denote this value c09-math-0327.
  • Step 3. Test for the significance of c09-math-0328. If c09-math-0329 is not significant, then remove the variable associated with c09-math-0330 from the model, and return to step 2. If c09-math-0331 is significant, then stop the algorithm and report the current model. If this is the first pass through the algorithm, then the current model is the full model. If this is not the first pass, then the current model has been reduced by one or more variables from the full model. Note that the analyst may choose the level of c09-math-0332 needed to remove variables. Lower values make it more difficult to keep variables in the model.

9.8.4 The Stepwise Procedure

The stepwise procedure represents a modification of the forward selection procedure. A variable that has been entered into the model early in the forward selection process may turn out to be nonsignificant, once other variables have been entered into the model. The stepwise procedure checks on this possibility, by performing at each step a partial F-test, using the partial sum of squares, for each variable currently in the model. If there is a variable in the model that is no longer significant, then the variable with the smallest partial F-statistic is removed from the model. The procedure terminates when no further variables can be entered or removed. The analyst may choose both the level of c09-math-0333 required to enter the model, and the level of c09-math-0334 needed to remove variables, with c09-math-0335 chosen to be somewhat large than c09-math-0336.

9.8.5 The Best Subsets Procedure

For data sets where the number of predictors is not too large, the best subsets procedure represents an attractive variable selection method. However, if there are more than 30 or so predictors, then the best subsets method encounters a combinatorial explosion, and becomes intractably slow.

The best subsets procedure works as follows:

  • Step 1. The analyst specifies how many (k) models of each size he or she would like reported, as well as the maximum number of predictors (p) the analyst wants in the model.
  • Step 2. All models of one predictor are built. Their c09-math-0337, Mallows' Cp (see below), and s values are calculated. The best k models are reported, based on these measures.
  • Step 3. Then all models of two predictors are built. Their c09-math-0338, Mallows' Cp, and s values are calculated, and the best k models are reported.
  • The procedure continues in this way until the maximum number of predictors (p) is reached. The analyst then has a listing of the best models of each size, 1, 2, …, p, to assist in the selection of the best overall model.

9.8.6 The All-Possible-Subsets Procedure

The four methods of model selection we have discussed are essentially optimization algorithms over a large sample space. Because of that, there is no guarantee that the globally optimal model will be found; that is, there is no guarantee that these variable selection algorithms will uncover the model with the lowest s, the highest c09-math-0339, and so on (Draper and Smith11; Kleinbaum, Kupper, Nizam, and Muller12). The only way to ensure that the absolute best model has been found is simply to perform all the possible regressions. Unfortunately, in data mining applications, there are usually so many candidate predictor variables available that this method is simply not practicable. Not counting the null model c09-math-0340, there are c09-math-0341 possible models to be built, using p predictors.

For small numbers of predictors, it is not a problem to construct all possible regressions. For example, for c09-math-0342 predictors, there are c09-math-0343 possible models. However, as the number of predictors starts to grow, the search space grows exponentially. For instance, for c09-math-0344 predictors, there are c09-math-0345 possible models, while for c09-math-0346 predictors, there are c09-math-0347 possible models. Thus, for most data mining applications, in which there may be hundreds of predictors, the all-possible-regressions procedure is not applicable. Therefore, the data miner may be inclined to turn to one of the four variable selection procedures discussed above. Even though there is no guarantee that the globally best model is found, these methods usually provide a useful set of models, which can provide positive results. The analyst can then adopt these models as starting points, and apply tweaks and modifications to coax the best available performance out of them.

9.9 Gas Mileage Data Set

At this point, it may be helpful to turn to a new data set to illustrate the nuts and bolts of variable selection methods. We shall use the Gas Mileage data set,13 where the target variable MPG (miles per gallon) is estimated using four predictors: cab space, horsepower, top speed, and weight. Let us explore this data set a bit. Figure 9.12 shows scatter plots of the target MPG with each of the predictors. The relationship between MPG and horsepower does not appear to be linear. Using the bulging rule

c09f012

Figure 9.12 Scatter plots of MPG with each predictor. Some non-linearity.

from Chapter 8, we therefore take the ln of each variable. The resulting scatter plots, shown in Figure 9.13, show improved linearity. We therefore proceed to perform linear regression of ln MPG on cab space, ln HP, top speed, and weight.

c09f013

Figure 9.13 Scatter plots of ln MPG with each predictor (including ln HP). Improved linearity.

9.10 An Application of Variable Selection Methods

We would like the most parsimonious model that does not leave out any significant predictors. We shall apply the variable selection methods described above. We select the following commonly used thresholds of significance for variables entering and leaving the model: c09-math-0348 and c09-math-0349.

9.10.1 Forward Selection Procedure Applied to the Gas Mileage Data Set

Table 9.15 shows the results for the forward selection method. We begin with no variables in the model. Then the variable most strongly correlated with ln MPG is selected, and, if significant, entered into the model. This variable is weight, which has the highest correlation with ln MPG, among the predictors. This is shown in the upper left of Table 9.15, showing weight as the first variable entered.

Table 9.15 Forward selection results

c09t015

Then the sequential F-tests are performed, such as c09-math-0350, and so on. It turns out that the highest sequential F-statistic is given by the significance test of c09-math-0351, so that the variable ln HP becomes the second variable entered into the model, as shown in Table 9.15. Once again, the sequential F-tests are performed, but no further significant variables were found. Thus, the forward selection method prefers the following model:

equation

Table 9.15 contains the ANOVA tables for the two models selected by the forward selection procedure. We may use these ANOVA results to calculate the sequential F-statistics. Model 1 represents the model with weight as the only predictor. Model 2 represents the model with both weight and ln HP entered as predictors.

As c09-math-0353, we have

equation

From Table 9.15, we have

  • c09-math-0355, and
  • c09-math-0356, giving us:
  • c09-math-0357

The test statistic for the partial (or, in this case, sequential) F-test is the following:

equation

From Table 9.15, we have

  • c09-math-0359, giving us:
  • c09-math-0360

With a sample size of 82, and p = 2 parameters in the model, this test statistic follows an c09-math-0361 distribution. The p-value for this test statistic is approximately zero, thereby rejecting the null hypothesis that fiber should not be included after sugars.

9.10.2 Backward Elimination Procedure Applied to the Gas Mileage Data Set

In the backward elimination procedure, we begin with all of the variables in the model. The partial F-statistic is then calculated for each variable in the model (e.g., c09-math-0362. The variable with the smallest partial F-statistic, c09-math-0363, is examined, which in this case is cab space. If c09-math-0364 is not significant, which is the case here, then the variable is dropped from the model. Cab space is the first variable to be removed, as is shown in Table 9.16. On the next pass, the variable with the smallest partial F-statistic is top speed, which again is not significant. Thus, top speed becomes the second variable omitted from the model. No other variables are removed from the model, so that the backward elimination method prefers the same model as the forward selection method.

Table 9.16 Backward elimination results

c09t016

9.10.3 The Stepwise Selection Procedure Applied to the Gas Mileage Data Set

The stepwise selection procedure is a modification of the forward selection procedure, where the algorithm checks at each step whether all variables currently in the model are still significant. In this example, each variable that had been entered remained significant when the other variables were also entered. Thus, for this example, the results were the same as for the forward selection procedure, with the same model summaries as shown in Table 9.15.

9.10.4 Best Subsets Procedure Applied to the Gas Mileage Data Set

Table 9.17 provides the results from Minitab's application of the best subsets procedure on the gas mileage data set. The predictor variable names are given on the upper right, formatted vertically. Each horizontal line in the table represents a separate model, with the “X”s shown under the predictors included in a particular model. The best subsets procedure reports the two best models with p = 1 predictor, the two best models with p = 2 models, and so on. Thus, the first model has only weight; the second model has only ln HP; the third model has ln HP and weight; the fourth model has top speed and weight; and so on.

Table 9.17 Best subsets results for Gas Mileage data set (“best” model highlighed)

c09t017

Four model selection criteria are reported for each model: c09-math-0365, Mallows' Cp, and s.

9.10.5 Mallows' Cp Statistic

We now discuss the c09-math-0366 statistic, developed by C. L. Mallows14. Mallows' c09-math-0367 statistic takes the form:

equation

where c09-math-0369 represents the number of predictors in the current (working) model, c09-math-0370 represents the error sum of squares of the model with p predictors, and c09-math-0371 represents the MSE of the full model; that is, the model with all predictors entered.

For a model that fits well, it can be shown15 that c09-math-0372. Thus, we would expect the value of c09-math-0373 for a well-fitting model to take a value not far from c09-math-0374. However, models that show a considerable lack of fit will take values of c09-math-0375 above (and sometimes far above) c09-math-0376. The full model, with all variables entered, always has c09-math-0377, but is often not the best model.

It is useful to plot the value of Mallows' c09-math-0378 against the number of predictors, p. Figure 9.14 shows such a plot for the gas mileage data set regression. (To increase granularity, the model with c09-math-0379 is omitted.) One heuristic for choosing the best model is to select the model where the value of c09-math-0380 first approaches or crosses the line c09-math-0381, as p increases.

c09f014

Figure 9.14 A plot of Mallows' c09-math-0382 against the number of predictors, p, can help select the best model.

Consider Figure 9.14. However, the general trend for the values of c09-math-0383 is to fall as p increases, as can be seen from Figure 9.15. As we reach c09-math-0384, we have c09-math-0385, which is approaching the line c09-math-0386. This represents the model chosen by the other three variable selection methods.

c09f015

Figure 9.15 Normal probability plot shows skewness.

Finally, when we reach c09-math-0387, we have, for one of the models, c09-math-0388, which is below the line c09-math-0389. Therefore, the Mallows' c09-math-0390 heuristic would be to select this model as the working model. This model contains ln HP, top speed, and weight as predictors.

Thus, we have two candidate working models:

equation

Model A is supported by forward selection, backward elimination, and stepwise, and was nearly favored by best subsets. Model B is preferred by best subsets, but barely. Let us mention that one need not report only one model as a final model. Two or three models may be carried forward, and input sought from managers about which model may be most ameliorative of the business or research problem. However, it is often convenient to have one “working model” selected, because of the complexity of model building in the multivariate environment. However, recall the principal of parsimony, which states All things being equal, choose the simpler model. Because of parsimony, and because Model A did so well with most of the variable selection methods, it is recommended that we consider Model A to be our working model. The regression results for Model A are shown in Table 9.18.

Table 9.18 Regression results for model chosen by variable selection criteria

c09t018

Checking for the regression assumptions, each of the graphs in Figure 9.16 shows an outlier, the Subaru Loyale, which got lower gas mileage than expected, given its predictor values. Table 9.19 shows the regression results when this outlier is omitted. The precision of the regression is improved; for example, the standard error of the estimate, s, has decreased by 6.6%.

c09f016

Figure 9.16 Outlier uncovered.

Table 9.19 Regression results improved a bit with outlier removed

c09t019

Figure 9.17 shows the plots for validation of the regression assumptions. With some slight right-skewness in the residuals, and some curvature in the residuals versus fitted values, these are not as tight as we might wish; in the exercises, we will try to deal with these issues. However, we are on the whole satisfied that our regression model provides a decent summary of the linear relationship between ln MPG and the predictors. Nevertheless, there still remains the problem of moderate multicollinearity, as shown by the VIF values close to 5 for the predictors. Thus, we now turn to a method made to deal with multicollinearity: principal components analysis.

c09f017

Figure 9.17 Regression assumptions.

9.11 Using the Principal Components as Predictors in Multiple Regression

Principal components16 may be used as predictors in a multiple regression model. Each record has a component value for each principal component, as shown in the rightmost four columns in Table 9.20. These component values may be used as predictors in a regression model, or, indeed, any analytical model.

Table 9.20 Each record has component weight values for each component

Make/Model MPG ln HP ln MPG Cab Space_z Horsepower_z Top Speed_z Weight_z PrinComp1 PrinComp2 PrinComp3 PrinComp4
GM/GeoMetroXF1 65.400 3.892 4.181 −0.442 −1.199 −1.169 −1.648 −0.770 −0.246 −1.454 2.449
GM/GeoMetro 56.000 4.007 4.025 −0.307 −1.093 −1.098 −1.341 −0.805 −0.167 −1.081 1.896
GM/GeoMetroLSI 55.900 4.007 4.024 −0.307 −1.093 −1.098 −1.341 −0.805 −0.167 −1.081 1.896
SuzukiSwift 49.000 4.248 3.892 −0.307 −0.829 −0.528 −1.341 −0.173 −0.081 −1.518 0.115
DaihatsuCharade 46.500 3.970 3.839 −0.307 −1.128 −1.169 −1.341 −0.885 −0.177 −1.026 2.094
GM/GeoSprintTurbo 46.200 4.248 3.833 −0.442 −0.829 −0.528 −1.341 −0.199 −0.229 −1.450 0.079
GM/GeoSprint 45.400 4.007 3.816 −0.307 −1.093 −1.098 −1.341 −0.805 −0.167 −1.081 1.896
HondaCivicCRXHF 59.200 4.127 4.081 −2.202 −0.970 −1.027 −1.034 −1.229 −2.307 0.302 1.012
HondaCivicCRXHF 53.300 4.127 3.976 −2.202 −0.970 −1.027 −1.034 −1.229 −2.307 0.302 1.012
DaihatsuCharade 43.400 4.382 3.770 −0.217 −0.653 −0.386 −1.034 −0.118 −0.039 −1.189 −0.246
SubaruJusty 41.100 4.290 3.716 −0.442 −0.776 −0.671 −1.034 −0.473 −0.328 −0.860 0.686
HondaCivicCRX 40.900 4.522 3.711 −2.202 −0.442 0.042 −1.034 −0.027 −2.145 −0.528 −1.953

First, the predictors from the original data set are all standardized, using z-scores. Then principal components analysis is performed on the standardized predictors, with varimax rotation. The variance-explained results are shown in Table 9.21. The varimax-rotated solution has nearly attained 100% of variance explained by three components. We therefore extract three components, to be used as predictors for our regression model.17

Table 9.21 Percentage of variance explained for the rotated solution for three components is nearly 100%

Total Variance Explained
Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings
Component Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %
1 2.689 67.236 67.236 2.689 67.236 67.236 2.002 50.054 50.054
2 1.100 27.511 94.747 1.100 27.511 94.747 1.057 26.436 76.490
3 0.205 5.137 99.884 0.205 5.137 99.884 0.935 23.386 99.876
4 0.005 0.116 100.000 0.005 0.116 100.000 0.005 0.124 100.000

Extraction method: Principal component analysis.

Table 9.22 shows the unrotated and rotated component weights, with weights less than 0.5 hidden, for clarity. Brief component profiles for the rotated solution are as follows:

  • Component 1: Muscle. This component combines top speed and horsepower.
  • Component 2: Roominess. The only variable is cab space.
  • Component 3: Weight. The only variable is weight.

Table 9.22 Component weights, for the unrotated and rotated solutions

c09t022

Regression of ln MPG on the three principal components is performed, with the results shown in Table 9.23 and the residual plots shown in Figure 9.15. Note that the multicollinearity problem has been solved, because the VIF statistics all equal a perfect 1.0. However, the normal probability plot of the residuals shows concave curvature, indicating right-skewness. We therefore apply the following Box–Cox transformation to MPG, to reduce the skewness:

equation

The residual plots for the resulting regression of c09-math-0393 on the principal components are shown in Figure 9.18. The skewness has mostly been dealt with. These plots are not perfect. Specifically, there appears to be a systematic difference for the set of vehicles near the end of the data set in observation order. A glance at the data set indicates these are luxury cars, such as a Rolls–Royce and a Jaguar, which may follow a somewhat different gas mileage model. Overall, we find the plots indicate broad validation of the regression assumptions. Remember, in the world of dirty data, perfect validation of the assumptions may be elusive.

Table 9.23 Regression using principal components solves the multicollinearity problem

c09t023
c09f018

Figure 9.18 Observation order shows luxury cars may be different.

The regression results for regression of c09-math-0394 on the principal components are shown in Table 9.24. Note the following:

  • Multicollinearity remains vanquished, with all VIF = 1.0.
  • c09-math-0395, not quite as good as the 93.5% for the model not accounting for multicollinearity.
  • Note the group of last four unusual observations, all high leverage points, consists of a Mercedes, a Jaguar, a BMW, and a Rolls–Royce. The Rolls–Royce is the most extreme outlier.

Table 9.24 Regression of MPGBC 0.75 on the principal components

c09t024

In the exercises, we invite the analyst to further improve this model, either by tweaking the Box–Cox transformation, or through an indicator variable for the luxury cars, or some other means.

R References

  1. Harrell FE Jr. 2014. rms: Regression modeling strategies. R package version 4.1-3. http://CRAN.R-project.org/package=rms.
  2. Fox J, Weisberg S. An {R} Companion to Applied Regression. 2nd ed. Thousand Oaks, CA: Sage; 2011. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.
  3. Ligges U, Mächler M. Scatterplot3d – an R Package for visualizing multivariate data. Journal of Statistical Software 2003;8(11):1–20.
  4. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. ISBN: 3-900051-07-0,http://www.R-project.org/.
  5. Revelle W. psych: Procedures for Personality and Psychological Research. Evanston, Illinois, USA: Northwestern University; 2013. . http://CRAN.R-project.org/package=psych Version=1.4.2.
  6. Thomas Lumley using Fortran code by Alan Miller. 2009. leaps: regression subset selection. R package version 2.9. http://CRAN.R-project.org/package=leaps.
  7. Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. New York: Springer; 2002. ISBN: 0-387-95457-0.

Exercises

Clarifying The Concepts

1. Indicate whether the following statements are true or false. If the statement is false, alter it so that the statement becomes true.

  1. If we would like to approximate the relationship between a response and two continuous predictors, we would need a plane.
  2. In linear regression, while the response variable is typically continuous, it may be categorical as well.
  3. In general, for a multiple regression with m predictor variables, we would interpret coefficient c09-math-0396 as follows: “the estimated change in the response variable for a unit increase in variable c09-math-0397 is c09-math-0398.”
  4. In multiple regression, the residual is represented by the vertical distance between the data point and the regression plane or hyperplane.
  5. Whenever a new predictor variable is added to the model, the value of c09-math-0399 always goes up.
  6. The alternative hypothesis in the F-test for the overall regression asserts that the regression coefficients all differ from zero.
  7. The standard error of the estimate is a valid measure of the usefulness of the regression, without reference to an inferential model (i.e., the assumptions need not be relevant).
  8. If we were to use only the categorical variables as predictors, then we would have to use analysis of variance and could not use linear regression.
  9. For use in regression, a categorical variable with k categories must be transformed into a set of k indicator variables.
  10. The first sequential sum of squares is exactly the value for SSR from the simple linear regression of the response on the first predictor.
  11. The VIF has a minimum of zero, but no upper limit.
  12. A variable that has been entered into the model early in the forward selection process will remain significant, once other variables have been entered into the model.
  13. The variable selection criteria for choosing the best model account for the multicollinearity among the predictors.
  14. The VIFs for principal components using varimax rotation always equal 1.0.

2. Clearly explain why s and c09-math-0400 are preferable to c09-math-0401 as measures for model building.

3. Explain the difference between the t-test and the F-test for assessing the significance of the predictors.

4. Construct indicator variables for the categorical variable class, which takes four values, freshman, sophomore, junior, and senior.

5. When using indicator variables, explain the meaning and interpretation of the indicator variable coefficients, graphically and numerically.

6. Discuss the concept of the level of significance c09-math-0402. At what value should it be set? Who should decide the value of c09-math-0403? What if the observed p-value is close to c09-math-0404? Describe a situation where a particular p-value will lead to two different conclusions, given two different values for c09-math-0405.

7. Explain what it means when c09-math-0406 is much less than c09-math-0407.

8. Explain the difference between the sequential sums of squares and the partial sums of squares. For which procedures do we need these statistics?

9. Explain some of the drawbacks of a set of predictors with high multicollinearity.

10. Which statistics report the presence of multicollinearity in a set of predictors? Explain, using the formula, how this statistic works. Also explain the effect that large and small values of this statistic will have on the standard error of the coefficient.

11. Compare and contrast the effects that multicollinearity has on the point and intervals estimates of the response versus the values of the predictor coefficients.

12. Describe the differences and similarities among the forward selection procedure, the backward elimination procedure, and the stepwise procedure.

13. Describe how the best subsets procedure works. Why not always use the best subsets procedure?

14. Describe the behavior of Mallows' c09-math-0408 statistic, including the heuristic for choosing the best model.

15. Suppose we wished to limit the number of predictors in the regression model to a lesser number than those obtained using the default settings in the variable selection criteria. How should we alter each of the selection criteria? Now, suppose we wished to increase the number of predictors. How then should we alter each of the selection criteria?

16. Explain the circumstances under which the value for c09-math-0409 would reach 100%. Now explain how the p-value for any test statistic could reach zero.

Working With The Data

For Exercises 17–27, consider the multiple regression output from SPSS in Table 9.25, using the nutrition data set, found on the book web site, www.DataMiningConsultant.com.

Table 9.25 Regression results for Exercises 17–27

c09t025

17. What is the response? What are the predictors?

18. What is the conclusion regarding the significance of the overall regression? How do you know? Does this mean that all of the predictors are important? Explain.

19. What is the typical error in prediction? (Hint: This may take a bit of digging.)

20. How many foods are included in the sample?

21. How are we to interpret the value of c09-math-0410, the coefficient for the constant term? Is this coefficient significantly different from zero? Explain how this makes sense.

22. Which of the predictors probably does not belong in the model? Explain how you know this. What might be your next step after viewing these results?

23. Suppose we omit cholesterol from the model and rerun the regression. Explain what will happen to the value of c09-math-0411.

24. Which predictor is negatively associated with the response? Explain how you know this.

25. Discuss the presence of multicollinearity. Evaluate the strength of evidence for the presence of multicollinearity. On the basis of this, should we turn to principal components analysis?

26. Clearly and completely express the interpretation for the coefficient for sodium.

27. Suppose a certain food was predicted to have 60 calories fewer than it actually has, based on its content of the predictor variables. Would this be considered unusual? Explain specifically how you would determine this.

For Exercises 28–29, next consider the multiple regression output from SPSS in Table 9.26. Three predictor variables have been added to the analysis in Exercises 17–27: saturated fat, monounsaturated fat, and polyunsaturated fat.

Table 9.26 Regression results for Exercises 28–29

Coefficientsa
Unstandardized coefficients Standardized Coefficients Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant) −0.158 0.772 −0.205 0.838
PROTEIN 4.278 0.088 0.080 48.359 0.000 0.457 2.191
FAT 9.576 1.061 0.585 9.023 0.000 0.000 3379.867
CHOLEST 1.539E−02 0.008 0.003 1.977 0.048 0.420 2.382
CARBO 3.860 0.014 0.558 285.669 0.000 0.325 3.073
IRON −1.672 0.314 −0.010 −5.328 0.000 0.377 2.649
SODIUM 5.183E−03 0.001 0.006 3.992 0.000 0.555 1.803
SAT_FAT −1.011 1.143 −0.020 −0.884 0.377 0.002 412.066
MONUNSAT −0.974 1.106 −0.025 −0.881 0.379 0.002 660.375
POLUNSAT −0.600 1.111 −0.013 −0.541 0.589 0.002 448.447

a Dependent variable: CALORIES.

28. Evaluate the strength of evidence for the presence of multicollinearity.

29. On the basis of this, should we turn to principal components analysis?

For Exercises 30–37, consider the multiple regression output from SPSS in Table 9.27, using the New York data set, found on the book web site, www.DataMiningConsultant.com. The data set contains demographic information about a set of towns in New York state. The response “MALE_FEM” is the number of males in the town for every 100 females. The predictors are the percentage under the age of 18, the percentage between 18 and 64, and the percentage over 64 living in the town (all expressed in percents such as “57.0”), along with the town's total population.

Table 9.27 Regression results for Exercises 30–37

c09t027

30. Note that the variable PCT_O64 was excluded. Explain why this variable was automatically excluded from the analysis by the software. (Hint: Consider the analogous case of using too many indicator variables to define a particular categorical variable.)

31. What is the conclusion regarding the significance of the overall regression?

32. What is the typical error in prediction?

33. How many towns are included in the sample?

34. Which of the predictors probably does not belong in the model? Explain how you know this. What might be your next step after viewing these results?

35. Suppose we omit TOT_POP from the model and rerun the regression. Explain what will happen to the value of c09-math-0412.

36. Discuss the presence of multicollinearity. Evaluate the strength of evidence for the presence of multicollinearity. On the basis of this, should we turn to principal components analysis?

37. Clearly and completely express the interpretation for the coefficient for PCT_U18. Discuss whether this makes sense.

Hands-On Analysis

For Exercises 38–41, use the nutrition data set, found on the book web site, www.DataMiningConsultant.com.

38. Build the best multiple regression model you can for the purposes of predicting calories, using all the other variables as the predictors. Do not worry about whether the predictor coefficients are stable or not. Compare and contrast the results from the forward selection, backward elimination, and stepwise variable selection procedures.

39. Apply the best subsets procedure, and compare against the previous methods.

40. (Extra credit). Write a script that will perform all possible regressions. Did the variable selection algorithms find the best regression?

41. Next, build the best multiple regression model you can for the purposes both of predicting the response and of profiling the predictors' individual relationship with the response. Make sure you account for multicollinearity.

For Exercises 42–44, use the New York data set, found on the book web site.

42. Build the best multiple regression model you can for the purposes of predicting the response, using the gender ratio as the response, and all the other variables as the predictors. Compare and contrast the results from the forward selection, backward elimination, and stepwise variable selection procedures.

43. Apply the best subsets procedure, and compare against the previous methods.

44. Perform all possible regressions. Did the variable selection algorithms find the best regression?

For Exercises 45–49, use the crash data set, found on the book web site.

45. Build the best multiple regression model you can for the purposes of predicting head injury severity, using all the other variables as the predictors.

46. Determine which variables must be made into indicator variables.

47. Determine which variables might be superfluous.

48. Build two parallel models, one where we account for multicollinearity, and another where we do not. For which purposes may each of these models be used?

49. Continuing with the crash data set, combine the four injury measurement variables into a single variable, defending your choice of combination function. Build the best multiple regression model you can for the purposes of predicting injury severity, using all the other variables as the predictors. Build two parallel models, one where we account for multicollinearity, and another where we do not. For which purposes may each of these models be used?

For Exercises 50–51, see if you can improve on the regression model of ln MPG on ln HP and weight.

50. Use a Box–Cox transformation to try to eliminate the skewness in the normal probability plot.

51. Do you see some curvature in the residuals versus fitted values plot? Produce a plot of the residuals against each of the predictors. Any curvature? Add a quadratic term of one of the predictors (e.g., c09-math-0413) to the model, and see if this helps.

52. Using the four criteria from Chapter 5, determine the best number of principal components to extract for the gas mileage data.

53. Take a shot at improving the regression of c09-math-0414 on the principal components. For example, you may wish to tweak the Box–Cox transformation, or you may wish to use an indicator variable for the luxury cars. Using whatever means you can bring to bear, obtain your best model that deals with multicollinearity and validates the regression assumptions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset