Chapter 14

Regression: Linear and Multiple

IN THIS CHAPTER

Summarizing a relationship

Working with regression

Hypothesis testing and regression

Balancing many relationships

One of the main things you do when you work with statistics is make predictions. The idea is to take data on one or more variables, and use these data to predict a value of another variable. To do this, you have to understand how to summarize relationships among variables, and to test hypotheses about those relationships.

In this chapter, I introduce regression, a statistical way to do just that. Regression also enables you to use the details of relationships to make predictions. First, I show you how to analyze the relationship between one variable and another. Then I show you how to analyze the relationship between a variable and two others. These analyses involve a good bit of calculation, and Excel is more than equal to the task.

The Plot of Scatter

Sahutsket University is an exciting, dynamic institution. Every year, the school receives thousands of applications. One challenge the Admissions Office faces is this: Applicants want the Office to predict what their GPAs (grade-point averages on a 4.0 scale) will be if they attend Sahutsket.

What’s the best prediction? Without knowing anything about an applicant, and only knowing its own students’ GPAs, the answer is clear: It’s the average GPA at Sahutsket U. Regardless of who the applicant is, that’s all the Admissions Office can say if its knowledge is limited.

With more knowledge about the students and about the applicants, a more accurate prediction becomes possible. For example, if Sahutsket keeps records on its students’ total SAT scores, the Admissions Office can match up each student’s GPA with his or her SAT score and see if the two pieces of data are somehow related. If they are, an applicant can supply his or her SAT score, and the Admissions Office can use that score to help make a prediction.

Figure 14-1 shows the GPA-SAT matchup in a graphic way. Because the points are scattered, it’s called a scatterplot. By convention, the vertical axis (the y-axis) represents what you’re trying to predict. That’s also called the dependent variable or the y-variable. In this case, that’s GPA. Also by convention, the horizontal axis (the x-axis) represents what you’re using to make your prediction. That’s also called the independent variable or the x-variable. Here, that’s SAT.

image

FIGURE 14-1: SATs and GAPs in the Sahutsket University student body.

Each point in the graph represents an individual student’s GPA and SAT. In a real scatterplot of a university student body, you’d see many more points than I show here. The general tendency of the set of points seems to be that high SAT scores are associated with high GPAs and low SAT scores are associated with low GPAs.

I singled out one of the points. It shows a Sahutsket student with an SAT score of 1,000 and a GPA of 2.5. I also show the average GPA to give you a sense that knowing the GPA-SAT relationship provides an advantage over knowing only the mean.

How do you make that advantage work for you? You start by summarizing the relationship between SAT and GPA. The summary is a line through the points. How and where do you draw the line?

I get to that in a minute. First, I have to tell you about lines in general.

Graphing Lines

In the world of mathematics, a line is a way to picture a relationship between an independent variable (x) and a dependent variable (y). In this relationship,

images

if I supply a value for x, I can figure out the corresponding value for y. The equation says to take the x-value, multiply by 2, and then add 4.

If x = 1, for example, y = 6. If x =2, y = 8. Table 14-1 shows a number of x-y pairs in this relationship, including the pair in which x = 0.

TABLE 14-1 x-y Pairs in y = 4 + 2x

x

y

0

4

1

6

2

8

3

10

4

12

5

14

6

16

Figure 14-2 shows these pairs as points on a set of x-y axes, along with a line through the points. Each time I list an x-y pair in parentheses, the x-value is first.

image

FIGURE 14-2: The graph for y = 4 + 2x.

As the figure shows, the points fall neatly onto the line. The line graphs the equation y = 4 + 2x. In fact, whenever you have an equation like this, where x isn’t squared or cubed or raised to any power higher than 1, you have what mathematicians call a linear equation. (If x is raised to a higher power than 1, you connect the points with a curve, not a line.)

remember A couple of things to keep in mind about a line: You can describe a line in terms of how slanted it is, and where it runs into the y-axis.

The how-slanted-it-is part is called the slope. The slope tells you how much y changes when x changes by one unit. In the line in Figure 14-2, when x changes by one (from 4 to 5, for example), y changes by two (from 12 to 14).

The where-it-runs-into-the-y-axis part is called the y-intercept (or sometimes just the intercept). That’s the value of y when x = 0. In Figure 14-2, the y-intercept is 4.

You can see these numbers in the equation. The slope is the number that multiplies x and the intercept is the number you add to x. In general,

images

where a represents the intercept and b represents the slope.

The slope can be a positive number, a negative number, or zero. In Figure 14-2, the slope is positive. If the slope is negative, the line is slanted in a direction opposite to what you see in Figure 14-2. A negative slope means that y decreases as x increases. If the slope is zero, the line is parallel to the horizontal axis. If the slope is zero, y doesn’t change as x changes.

The same applies to the intercept — it can be a positive number, a negative number, or zero. If the intercept is positive, the line cuts off the y-axis above the x-axis. If the intercept is negative, the line cuts off the y-axis below the x-axis. If the intercept is zero, it intersects with the y-axis and the x-axis, at the point called the origin.

And now, back to what I was originally talking about.

Regression: What a Line!

I mention earlier that a line is the best way to summarize the relationship in the scatterplot in Figure 14-1. It’s possible to draw an infinite amount of straight lines through the scatterplot. Which one best summarizes the relationship?

Intuitively, the “best fitting” line ought to be the one that goes through the maximum number of points and isn’t too far away from the points it doesn’t go through. For statisticians, that line has a special property: If you draw that line through the scatterplot, then draw distances (in the vertical direction) between the points and the line, and then square those distances and add them up, the sum of the squared distances is a minimum.

Statisticians call this line the regression line, and indicate it as

images

Each y’ is a point on the line. It represents the best prediction of y for a given value of x.

To figure out exactly where this line is, you calculate its slope and its intercept. For a regression line, the slope and intercept are called regression coefficients.

The formulas for the regression coefficients are pretty straightforward. For the slope, the formula is

images

The intercept formula is

images

I illustrate with an example. To keep the numbers manageable and comprehensible, I use a small sample instead of the thousands of students you’d find in a scatterplot of an entire university student body. Table 14-2 shows a sample of data from 20 Sahutsket University students.

TABLE 14-2 SAT Scores and GPAs for 20 Sahutsket University Students

Student

SAT

GPA

1

990

2.2

2

1150

3.2

3

1080

2.6

4

1100

3.3

5

1280

3.8

6

990

2.2

7

1110

3.2

8

920

2.0

9

1000

2.2

10

1200

3.6

11

1000

2.1

12

1150

2.8

13

1070

2.2

14

1120

2.1

15

1250

2.4

16

1020

2.2

17

1060

2.3

18

1550

3.9

19

1480

3.8

20

1010

2.0

Mean

1126.5

2.705

Variance

26171.32

0.46

Standard Deviation

161.78

0.82

For this set of data, the slope of the regression line is

images

The intercept is

images

So the equation of the best-fitting line through these 20 points is

images

or in terms of GPAs and SATs:

images

Using regression for forecasting

Based on this sample and this regression line, you can take an applicant’s SAT score — say, 1230 — and predict the applicant’s GPA:

images

Without this rule, the only prediction is the mean GPA, 2.705.

Variation around the regression line

In Chapter 5, I describe how the mean doesn’t tell the whole story about a set of data. You have to show how the scores vary around the mean. For that reason, I introduce the variance and standard deviation.

You have a similar situation here. To get the full picture of the relationship in a scatterplot, you have to show how the scores vary around the regression line. Here, I introduce the residual variance and standard error of estimate, which are analogous to the variance and the standard deviation.

The residual variance is sort of an average of the squared deviations of the observed y-values around the predicted y-values. Each deviation of a data point from a predicted point (y - y’) is called a residual, hence the name. The formula is

images

I say sort of because the denominator is N-2 rather than N. The reason for the -2 is beyond the scope of this discussion. As I discuss earlier, the denominator of a variance estimate is degrees of freedom (df), and that concept comes in handy in a little while.

The standard error of estimate is

images

To show you how the residual error and the standard error of estimate play out for the data in the example, here’s Table 14-3. This table extends Table 14-2 by showing the predicted GPA for each SAT score.

TABLE 14-3 SAT Scores, GPAs, and Predicted GPAs for 20 Sahutsket University Students

Student

SAT

GPA

Predicted GPA

1

990

2.2

2.24

2

1150

3.2

2.79

3

1080

2.6

2.55

4

1100

3.3

2.61

5

1280

3.8

3.23

6

990

2.2

2.24

7

1110

3.2

2.65

8

920

2.0

2.00

9

1000

2.2

2.27

10

1200

3.6

2.96

11

1000

2.1

2.27

12

1150

2.8

2.79

13

1070

2.2

2.51

14

1120

2.1

2.68

15

1250

2.4

3.13

16

1020

2.2

2.34

17

1060

2.3

2.48

18

1550

3.9

4.16

19

1480

3.8

3.92

20

1010

2.0

2.31

Mean

1126.5

2.705

Variance

26171.32

0.46

Standard Deviation

161.78

0.82

As the table shows, sometimes the predicted GPA is pretty close, and sometimes it’s not. One predicted value (4.16) is impossible.

For these data, the residual variance is

images

The standard error of estimate is

images

If the residual variance and the standard error of estimate are small, the regression line is a good fit to the data in the scatterplot. If the residual variance and the standard error of estimate are large, the regression line is a poor fit.

What’s “small”? What’s “large”? What’s a “good” fit?

Keep reading.

Testing hypotheses about regression

The regression equation you are working with,

images

summarizes a relationship in a scatterplot of a sample. The regression coefficients a and b are sample statistics. You can use these statistics to test hypotheses about population parameters, and that’s what you do in this section.

The regression line through the population that produces the sample (like the entire Sahutsket University student body, past and present) is the graph of an equation that consists of parameters, rather than statistics. By convention, remember, Greek letters stand for parameters, so the regression equation for the population is

images

The first two Greek letters on the right are α (alpha) and β (beta), the equivalents of a and b. What about that last one? It looks something like the Greek equivalent of e. What’s it doing there?

That last term is the Greek letter epsilon. It represents error in the population. In a way, error is an unfortunate term. It’s a catchall for “things you don’t know or things you have no control over.” Error is reflected in the residuals — the deviations from the predictions. The more you understand about what you’re measuring, the more you decrease the error.

You can’t measure the error in the relationship between SAT and GPA, but it’s lurking there. Someone might score low on the SAT, for example, and then go on to have a wonderful college career with a higher-than-predicted GPA. On a scatterplot, this person’s SAT-GPA point looks like an error in prediction. As you find out more about that person, you might discover that he or she was sick on the day of the SAT, and that explains the “error.”

You can test hypotheses about α, β, and ε, and that’s what you do in the upcoming subsections.

Testing the fit

You begin with a test of how well the regression line fits the scatterplot. This is a test of ε, the error in the relationship.

The objective is to decide whether the line really does represent a relationship between the variables. It’s possible that what looks like a relationship is just due to chance and the equation of the regression line doesn’t mean anything (because the amount of error is overwhelming) — or it’s possible that the variables are strongly related.

These possibilities are testable, and you set up hypotheses to test them:

H0: No real relationship

H1: Not H0

Although those hypotheses make nice light reading, they don’t set up a statistical test. To set up the test, you have to consider the variances. To consider the variances, you start with the deviations. Figure 14-3 focuses on one point in a scatterplot and its deviation from the regression line (the residual) and from the mean of the y-variable. It also shows the deviation between the regression line and the mean.

image

FIGURE 14-3: The deviations in a scatterplot.

As the figure shows, the distance between the point and the regression line and the distance between the regression line and the mean add up to the distance between the point and the mean:

images

This sets the stage for some other important relationships.

Start by squaring each deviation. That gives you images, images, and images. If you add up each of the squared deviations, you have

images

You just saw this one. That’s the numerator for the residual variance. It represents the variability around the regression line— the “error” I mention earlier. In the terminology of Chapter 12, the numerator of a variance is called a Sum of Squares, or SS. So this is SSResidual.

images

This one is new. The deviation images represents the gain in prediction due to using the regression line rather than the mean. The sum reflects this gain, and is called SSRegression.

images

I show you this one in Chapter 5 — although I use x rather than y. That’s the numerator of the variance of y. In Chapter 12 terms, it’s the numerator of total variance. This one is SSTotal.

This relationship holds among these three sums:

images

Each one is associated with a value for degrees of freedom — the denominator of a variance estimate. As I point out in the preceding section, the denominator for SSResidual is N-2. The df for SSTotal is N-1. (See Chapters 5 and 12.) As with the SS, the degrees of freedom add up:

images

This leaves one degree of freedom for Regression.

Where is this all headed, and what does it have to do with hypothesis testing? Well, since you asked, you get variance estimates by dividing SS by df. Each variance estimate is called a Mean Square, abbreviated MS (again, see Chapter 12):

images
images
images

Now for the hypothesis part. If H0 is true and what looks like a relationship between x and y is really no big deal, the piece that represents the gain in prediction because of the regression line (MSRegression) should be no greater than the variability around the regression line (MSResidual). If H0 is not true, and the gain in prediction is substantial, then MSRegression should be a lot bigger than MSResidual.

So the hypotheses now set up as

H0: σ2Regression ≤ σ2Residual

H1: σ2Regression > σ2Residual

These are hypotheses you can test. How? To test a hypothesis about two variances, you use an F test. (See Chapter 11.) The test statistic here is

images

To show you how it all works, I apply the formulas to the Sahutsket example. The MSResidual is the same as syx2 from the preceding section, and that value is 0.16. The MSRegression is

images

This sets up the F:

images

With 1 and 18 df and α = .05, the critical value of F is 4.41. (You can use the worksheet function F.INV.RT to verify.) The calculated F is greater than the critical F, so the decision is to reject H0. That means the regression line provides a good fit to the data in the sample.

Testing the slope

Another question that arises in linear regression is whether the slope of the regression line is significantly different from zero. If it’s not, the mean is just as good a predictor as the regression line.

The hypotheses for this test are:

H0: β ≤ 0

H1: β > 0

The statistical test is t, which I discuss in Chapters 9, 10, and 11 in connection with means. The t-test for the slope is

images

with df = N-2. The denominator estimates the standard error of the slope. This term sounds more complicated than it is. The formula is

images

where sx is the standard deviation of the x-variable. For the data in the example

images
images

The actual value is 6.00. If you round syx and sb to a manageable number of decimal places before calculating, you end up with 5.96. Either way, this is larger than the critical value of t for 18 df and α = .05 (2.10), so the decision is to reject H0. This example, by the way, shows why it’s important to test hypotheses. The slope, 0.0034, looks like a very small number. (Possibly because it’s a very small number.) Still, it’s big enough to reject H0 in this case.

Testing the intercept

technicalstuff For completeness, I include the hypothesis test for the intercept. I doubt you’ll have much use for it, but it appears in the output of some of Excel’s regression-related capabilities. I want you to understand all aspects of that output (which I tell you about in a little while), so here it is.

The hypotheses are

H0: α = 0

H1: α ≠ 0

The test, once again, is a t-test. The formula is

images

The denominator is the estimate of the standard error of the intercept. Without going into detail, the formula for sa is

images

where images is the squared mean of the x-variable and sx2 is the variance of the x-variable . Applying this formula to the data in the example,

images

The t-test is

images

With 18 degrees of freedom, and the probability of a Type I error at .05, the critical t is 2.45 for a two-tailed test. It’s a two-tailed test because H1 is that the intercept doesn’t equal zero — it doesn’t specify whether the intercept is greater than zero or less than zero. Because the calculated value isn’t more negative than the negative critical value, the decision is to not reject H0.

Worksheet Functions for Regression

Excel is a big help for computation-intensive work like linear regression. An assortment of functions and data analysis tools makes life a lot easier. In this section, I concentrate on the worksheet functions and on two array functions.

Figure 14-4 shows the data I use to illustrate each function. The data are GPA and SAT scores for 20 students in the example I show you earlier. As the figure shows, the SAT scores are in C3:C22 and the GPAs are in D3:D22. The SAT is the x-variable and GPA is the y-variable.

image

FIGURE 14-4: Data for the regression-related worksheet functions.

To clarify what the functions do, I define names for the data arrays. I define SAT as the name for C3:C22 and I define GPA as the name for D3:D22. That way, I can use those names in the arguments for the functions. If you don’t know how to define a name for an array, go to Chapter 2.

SLOPE, INTERCEPT, STEYX

These three functions work the same way, so I give a general description and provide details as necessary for each function. Follow these steps:

  1. With the data entered, select a cell.
  2. From the Statistical Functions menu, select a regression function to open its Function Arguments dialog box.

    • To calculate the slope of a regression line through the data, select SLOPE.
    • To calculate the intercept, select INTERCEPT.
    • To calculate the standard error of estimate, select STEYX.

    Figures 14-5, 14-6, and 14-7 show the Function Arguments dialog boxes for these three functions.

  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Known_y’s box, I enter the name for the cell range that holds the scores for the y-variable. For this example, that’s GPA (defined as the name for C3:C22).

    In the Known_x’s box, I enter the name for the cell range that holds the scores for the x-variable. For this example, it’s SAT (defined as the name for D3:D22). After I enter this name, the answer appears in the dialog box.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 14-5: The Function Arguments dialog box for SLOPE.

image

FIGURE 14-6: The Function Arguments dialog box for INTERCEPT.

image

FIGURE 14-7: The Function Arguments dialog box for STEYX.

FORECAST.LINEAR

This one is a bit different from the preceding three. In addition to the columns for the x and y variables, for FORECAST.LINEAR (renamed from FORECAST in earlier Excels, but still called FORECAST in the Mac version), you supply a value for x and the answer is a prediction based on the linear regression relationship between the x-variable and the y-variable.

Figure 14-8 shows the Function Arguments dialog box for FORECAST.LINEAR. In the X box, I entered 1290. For this SAT, the figure shows the predicted GPA is 3.265070236.

image

FIGURE 14-8: The Function Arguments dialog box for FORECAST.LINEAR.

Array function: TREND

TREND is a versatile function. You can use TREND to generate a set of predicted y-values for the x-values in the sample.

You can also supply a new set of x-values and generate a set of predicted y-values, based on the linear relationship in your sample. It’s like applying FORECAST repeatedly in one fell swoop.

In this section, I describe both uses.

Predicting y’s for the x’s in your sample

First, I use TREND to predict GPAs for the 20 students in the sample. Figure 14-9 shows TREND set up to do this. I include the Formula bar in this screen shot so that you can see what the formula looks like for this use of TREND.

  1. With the data entered, select a column for TREND’s answers.

    I select E3:E22. That puts the predicted GPAs right next to the sample GPAs.

  2. From the Statistical Functions menu, select TREND to open the Function Arguments dialog box for TREND.
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Known_y’s box, I enter the cell range that holds the scores for the y-variable. For this example, that’s D3:D22.

    In the Known_x’s box, enter the cell range that holds the scores for the x-variable. For this example, it’s C3:C22.

    Why didn’t I just enter GPA and SAT as in the previous examples? When I do that, and complete the next step, Excel doesn’t calculate the values. (This happened in Excel 2013 as well.)

    I left the New_x’s box blank.

    In the Const box, I typed TRUE (or I could leave it blank) to calculate the y-intercept, or I would type FALSE to set the y-intercept to zero.

    (I really don’t know why you’d enter FALSE.) A note of caution: In the dialog box, the instruction for the Const box refers to b. That’s the y-intercept. Earlier in the chapter, I use a to represent the y-intercept, and b to represent the slope. No usage is standard for this.

  4. IMPORTANT: Do not click OK. Because this is an array function, press Ctrl+Shift+Enter to put TREND’s answers into the selected array.
image

FIGURE 14-9: The Function Arguments dialog box for TREND, along with data. TREND is set up to predict GPAs for the sample SATs.

Figure 14-10 shows the answers in E3:E22. I include the Formula bar so you can see that Excel surrounds the completed array formula with curly brackets.

image

FIGURE 14-10: The results of TREND: Predicted GPAs for the sample SATs.

Predicting a new set of y’s for a new set of x’s

Here, I use TREND to predict GPAs for four new SAT scores. Figure 14-11 shows TREND set up for this, with the name New_SAT defined for the cell range that holds the new scores. The figure also shows the selected cell range for the results. Once again, I include the Formula bar to show you the formula for this use of the function.

  1. With the data entered, select a cell range for TREND’s answers.

    I selected G8:G11.

  2. From the Statistical Functions menu, select TREND to open the Function Arguments dialog box for TREND.
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Known_y’s box, enter the name of the cell range that holds the scores for the y-variable. For this example, that’s D3:D22.

    In the Known_x’s box, enter the name of the cell range that holds the scores for the x-variable. For this example, it’s C3:C22.

    In the New_x’s box, enter the name of the cell range that holds the new scores for the x-variable. Here, that’s New_SAT. (A named array works in this context.)

    In the Const box, the choices are to enter TRUE (or leave it blank) to calculate the y-intercept, or FALSE to set the y-intercept to zero. I entered TRUE. (Again, I really don’t know why you’d enter FALSE.)

  4. IMPORTANT: Do not click OK. Because this is an array function, press Ctrl+Shift+Enter to put TREND’s answers into the selected column.

    Figure 14-12 shows the answers in G8:G11. Again, I include the Formula bar to show you that Excel surrounds the completed array formula with curly brackets.

image

FIGURE 14-11: The Function Arguments dialog box for TREND, along with data. TREND is set up to predict GPAs for a new set of SATs.

image

FIGURE 14-12: The results of TREND: Predicted GPAs for a new set of SATs.

Array function: LINEST

LINEST combines SLOPE, INTERCEPT, and STEYX, and throws in a few extras. Figure 14-13 shows the Function Arguments dialog box for LINEST, along with the data and the selected array for the answers. Notice that it’s a five-row-by-two-column array. For linear regression, that’s what the selected array has to be. How would you know the exact row-column dimensions of the array if I didn’t tell you? Well … you wouldn’t.

image

FIGURE 14-13: The Function Arguments dialog box for LINEST, along with the data and the selected array for the results.

Here are the steps for using LINEST:

  1. With the data entered, select a five-row-by-two-column array of cells for LINEST’s results.

    I selected G7:H11.

  2. From the Statistical Functions menu, select LINEST to open the Function Arguments dialog box for LINEST.
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Known_y’s box, I entered the cell range that holds the scores for the y-variable. For this example, that’s D3:D22.

    In the Known_x’s box, I entered the cell range that holds the scores for the x-variable. For this example, it’s C3:C22. (Named arrays won’t work here, either.)

    In the Const box, the choices are to enter TRUE (or leave it blank) to calculate the y-intercept, or FALSE to set the y-intercept to zero. I entered TRUE.

    In the Stats box, the choices are to enter TRUE to return regression statistics in addition to the slope and the intercept, or FALSE (or leave it blank) to return just the slope and the intercept. I entered TRUE.

    In the dialog box, b refers to intercept and m-coefficient refers to slope. As I mention earlier, no set of symbols is standard for this.

  4. IMPORTANT: Do not click OK. Because this is an array function, press Ctrl+Shift+Enter to put LINEST’s answers into the selected array.

Figure 14-14 shows LINEST’s results. They’re not labeled in any way, so I added the labels for you in the worksheet. The left column gives you the slope, standard error of the slope, something called “R Square,” F, and the SSRegression. What’s R Square? That’s another measure of the strength of the relationship between SAT and GPA in the sample. I discuss it in detail in Chapter 15.

image

FIGURE 14-14: LINEST’s results in the selected array.

The right column provides the intercept, standard error of the intercept, standard error of estimate, degrees of freedom, and SSResidual.

Data Analysis Tool: Regression

Excel’s Regression data analysis tool does everything LINEST does (and more) and labels the output for you, too. Figure 14-15 shows the Regression tool’s dialog box, along with the data for the SAT-GPA example.

image

FIGURE 14-15: The Regression data analysis tool dialog box, along with the SAT-GPA data.

The steps for using this tool are:

  1. Type the data into the worksheet, and include labels for the columns.
  2. Select Data| Data Analysis to open the Data Analysis dialog box.
  3. In the Data Analysis dialog box, scroll down the Analysis Tools list and select Regression.
  4. Click OK to open the Regression dialog box.

    This is the dialog box shown in Figure 14-15.

  5. In the Input Y Range box, enter the cell range that holds the data for the y-variable.

    For the example, the GPAs (including the label) are in $D$2:$D$22.

    Note the $ signs for absolute referencing. Excel adds these when you select the cells in the spreadsheet.

  6. In the Input X Range box, enter the cell range that holds the data for the x-variable.

    The SATs (including the label) are in $C$2:$C$22.

  7. If the cell ranges include column headings, select the Labels check box.

    I included the headings in the ranges, so I selected the box.

  8. The Alpha box has 0.05 as a default. Change that value if you want a different alpha.
  9. In the Output Options, select a radio button to indicate where you want the results.

    I selected New Worksheet Ply to put the results on a new page in the worksheet.

  10. The Residuals area provides four capabilities for viewing the deviations between the data points and the predicted points. Check as many as you like.

    I selected all four. I explain them when I show you the output.

  11. Select the Normal Probability Plots option if you want to produce a graph of the percentiles of the y-variable.

    I selected this one so I could show it to you in the output.

  12. Click OK.

    Because I selected New Worksheet Ply, a newly created page opens with the results.

tip A word about Steps 4 and 5: Notice that I didn’t enter the names for the cell ranges (GPA and SAT). Instead, I entered the ranges ($D$2:$D$22 and $C$2:$C$22). Why? When I define a name for a cell range, I don’t include the cell that holds the name (for reasons I explain in Chapter 2). Following this practice, however, creates a small hurdle when you use a data analysis tool: You can’t select the Labels check box if the defined names aren’t in the named range. Selecting that check box makes the variable names show up in the output — which is a good thing. So … I just enter the cell range including the name cell and select the Labels check box.

Tabled output

Figure 14-16 shows the upper half of the tool’s tabled output, after I expand the columns. The title is Summary Output. This part of the output features one table for the regression statistics, another for ANOVA, and one for the regression coefficients.

image

FIGURE 14-16: The upper half of the Regression data analysis tool’s tabled output.

The first three rows of the Regression Statistics table present information related to R2, a measure of the strength of the SAT-GPA relationship in the sample. The fourth row shows the standard error of estimate and the fifth gives the number of individuals in the sample.

The ANOVA table shows the results of testing

H0: σRegression ≤ σResidual

H1: σRegression > σResidual

If the value in the F-significance column is less than .05 (or whatever alpha level you’re using), reject H0. In this example, it’s less than .05.

Just below the ANOVA table is a table that gives the information on the regression coefficients. Excel doesn’t name it, but I refer to it as the coefficients table. The Coefficients column provides the values for the intercept and the slope. The slope is labeled with the name of the x-variable. The Standard Error column presents the standard error of the intercept and the standard error of the slope.

The remaining columns provide the results for the t-tests of the intercept and the slope. The P-value column lets you know whether to reject H0 for each test. If the value is less than your alpha, reject H0. In this example, the decision is to reject H0 for the slope, but not for the intercept.

Figure 14-17 shows the lower half of the Regression tool’s tabled output.

image

FIGURE 14-17: The lower half of the Regression data analysis tool’s tabled output.

Here, you find the Residual Output and the Probability Output. The Residual Output is a table that shows the predicted value and the residual (y-y’) for each individual in the sample. It also shows the standard residual for each observation, which is

images

The tabled data on residuals and standard residuals are useful for analyzing the variability around the regression line. You can scan these data for outliers, for example, and see if outliers are associated with particular values of the x-variable. (If they are, it might mean that something weird is going on in your sample.)

The Probability Output is a table of the percentiles in the y-variable data in the sample. (Yes, Percentile Output would be a better name.)

Graphic output

Figures 14-18, 14-19, and 14-20 show the Regression tool’s graphic output. The Normal Probability Plot in Figure 14-18 is a graphic version of the Probability Output table. The SAT Residual Plot in Figure 14-19 shows the residuals graphed against the x-variable: For each SAT score in the sample, this plot shows the corresponding residual. Figure 14-20 shows the SAT Line Fit Plot — a look at the observed and the predicted y-values.

image

FIGURE 14-18: The Normal Probability Plot shows the percentile data for the y-variable.

image

FIGURE 14-19: The SAT Residual plot graphs the residuals against the x-variable.

image

FIGURE 14-20: The SAT Line Fit Plot shows the observed y-values and the predicted y-values.

If you’ve used the Regression tool in Excel 2007, you’ll notice a change in the Normal Probability Plot and the Line Fit Plot. In Excel 2007, they appear as column charts. You can use Excel’s graphics capabilities to easily change them to that format.

Juggling Many Relationships at Once: Multiple Regression

Linear regression is a great tool for making predictions. When you know the slope and the intercept of the line that relates two variables, you can take a new x-value and predict a new y-value. In the example you’ve been working through, you take a SAT score and predict a GPA for a Sahutsket University student.

What if you knew more than just the SAT score for each student? What if you had the student’s high-school average (on a 100 scale), and you could use that information, too? If you could combine SAT score with HS average, you might have a more accurate predictor than SAT score alone.

When you work with more than one independent variable, you’re in the realm of multiple regression. As in linear regression, you find regression coefficients for the best-fitting line through a scatterplot. Once again, best-fitting means that the sum of the squared distances from the data points to the line is a minimum.

With two independent variables, however, you can’t show a scatterplot in two dimensions. You need three dimensions, and that becomes difficult to draw. Instead, I just show you the equation of the regression line:

images

For the SAT-GPA example, that translates to

images

You can test hypotheses about the overall fit, and about all three of the regression coefficients.

I won’t go through all the formulas for finding the coefficients, because that gets really complicated. Instead, I go right to the Excel capabilities.

A few things to bear in mind before I proceed:

  • You can have any number of x-variables. I just use two in the upcoming example.
  • Expect the coefficient for SAT to change from linear regression to multiple regression. Expect the intercept to change, too.
  • Expect the standard error of estimate to decrease from linear regression to multiple regression. Because multiple regression uses more information than linear regression, it reduces the error.

Excel Tools for Multiple Regression

The good news about Excel’s multiple regression tools is that they’re the same ones I just told you about for linear regression: You just use them in a slightly different way.

The bad news is … well … uh … I can’t think of any bad news!

TREND revisited

I begin with TREND. Earlier, I show you how to use this function to predict values based on one x-variable. Change what you enter into the dialog box, and it predicts values based on more than one.

Figure 14-21 shows the TREND dialog box and data for 20 students. In the data, I add a column for each student’s high-school average. I define HS_Average as the name for the data in this column. The figure also shows the selected column for TREND’s predictions. I include the Formula bar in this screen shot so you can see what the formula looks like.

image

FIGURE 14-21: The Function Arguments dialog box for TREND along with data. TREND is set up to predict GPAs for the sample SATs and high-school averages.

Follow these steps:

  1. With the data entered, select a column for TREND’s answers.

    I selected F3:F22. That puts the predicted GPAs right next to the sample GPAs.

  2. From the Statistical Functions menu, select TREND to open the Function Arguments dialog box for TREND.
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Known_y’s box, I entered the cell range that holds the scores for the y-variable. For this example, that’s E3:E22.

    In the Known_x’s box, I entered the cell range that holds the scores for the x-variables. This range is C3:D22, the cells that hold the SAT scores and the high-school averages.

    I entered the ranges rather than defined names because names don’t work with this function in my copy of Excel.

    I left the New_x’s box blank.

    In the Const box, the choices are TRUE (or leave it blank) to calculate the y-intercept, or FALSE to set the y-intercept to zero. I entered TRUE. (I really don’t know why you’d enter FALSE.) A note of caution: In the dialog box, the instruction for the Const box refers to b. That’s the y-intercept. Earlier in the chapter, I use a to represent the y-intercept, and b to represent the slope. No particular usage is standard for this. Also, the dialog box makes it sound like this function works only for linear regression. As you’re about to see, it works for multiple regression, too.

  4. IMPORTANT: Do not click OK. Because this is an array function, press Ctrl+Shift+Enter to put TREND’s answers into the selected column.

Figure 14-22 shows the answers in F3:F22. Note the difference between the Formula bar in Figure 14-21 and the one in Figure 14-22. After the function completes its work, Excel adds curly brackets to indicate an array formula.

image

FIGURE 14-22: The results of TREND: Predicted GPAs for the sample SATs and high-school averages.

So TREND predicts the values, and I haven’t even shown you how to find the coefficients yet!

LINEST revisited

To find the multiple regression coefficients, I turn again to LINEST.

In Figure 14-23, you can see the data and the dialog box for LINEST, along with the data and the selected array for the answers. The selected array is five-rows-by-three-columns. It’s always five rows. The number of columns is equal to the number of regression coefficients. For linear regression, it’s two — the slope and the intercept. For this case of multiple regression, it’s three.

image

FIGURE 14-23: The Function Arguments dialog box for LINEST, along with the data and the selected array for the results of a multiple regression.

Here are the steps for using LINEST for multiple regression with three coefficients:

  1. With the data entered, select a five-row-by-three-column array of cells for LINEST’s results.

    I selected H3:J7.

  2. From the Statistical Functions menu, select LINEST to open the Function Arguments dialog box for LINEST.
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Known_y’s box, enter the column that holds the scores for the y-variable. For this example, that’s E3:E22, the GPAs.

    In the Known_x’s box, enter the cell range that holds the scores for the x-variables. For this example, it’s C3:D22, the SAT scores and the high-school averages.

    In the Const box, enter TRUE (or leave it blank) to calculate the y-intercept. Enter FALSE to set the y-intercept to zero. I typed TRUE.

    In the Stats box, enter TRUE to return regression statistics in addition to the slope and the intercept, or FALSE (or leave it blank) to return just the slope and the intercept. I entered TRUE. The dialog box refers to the intercept as b and to the other coefficients as m-coefficients. I use a to represent the slope and b to refer to the other coefficients. No set of symbols is standard.

  4. IMPORTANT: Do not click OK. Because this is an array function, press Ctrl+Shift+Enter to put LINEST’s answers into the selected array.

Figure 14-24 shows LINEST’s results. They’re not labeled in any way, so I added the labels for you in the worksheet. I also drew a box around part of the results to clarify what goes with what.

image

FIGURE 14-24: LINEST’s multiple results in the selected array.

The entries that stand out are the ugly #N/A symbols in the last three rows of the rightmost column. These indicate that LINEST doesn’t put anything into these cells.

The top two rows of the array provide the values and standard errors for the coefficients. I drew the box around those rows to separate them from the three remaining rows, which present information in a different way. Before I get to those rows, I’ll just tell you that the top row gives you the information for writing the regression equation:

images

In terms of SAT, GPA, and high-school average, it’s

images

The third row has R Square (a measure of the strength of the relationship between GPA and the other two variables, which I cover in Chapter 15) and the standard error of estimate. Compare the standard error of estimate for the multiple regression with the standard error for the linear regression, and you’ll see that the multiple one is smaller. (Never mind. I’ll do it for you. It’s .40 for the linear and 0.35 for the multiple.)

The fourth row shows the F-ratio that tests the hypothesis about whether or not the line is a good fit to the scatterplot, and the df for the denominator of the F. The df for the numerator (not shown) is the number of coefficients minus 1. You can use F.INV.RT to verify that this F with df = 2 and 17 is significant.

The last row gives you SSRegression and SSResidual.

Regression data analysis tool revisited

To use the Regression data analysis tool for multiple regression, you trot out the same technique you’d use with TREND and LINEST: Specify the appropriate array for the x-variables, and you’re off and running.

Here are the steps:

  1. Type the data into the worksheet, and include labels for the columns.
  2. Select Data | Data Analysis to open the Data Analysis dialog box.
  3. In the Data Analysis dialog box, scroll down the Analysis Tools list and select Regression.
  4. Click OK to open the Regression dialog box.

    This is the dialog box shown back in Figure 14-15.

  5. In the Input Y Range box, enter the cell range that holds the data for the y-variable.

    The GPAs (including the label) are in $E$2:$E$22. Note the $ signs for absolute referencing. Excel adds them when you select the cell range in the worksheet.

  6. In the Input X Range box, enter the cell range that holds the data for the x-variable.

    The SATs and the high-school averages (including the labels) are in $C$2:$D$22.

  7. If the cell ranges include column headings, select the Labels check box.

    I included the labels in the ranges, so I selected the box.

  8. The Alpha box has 0.05 as a default. Change that value if you want a different alpha.

    In the Output Options, select a radio button to indicate where you want the results.

  9. I selected New Worksheet Ply to put the results on a new page in the worksheet.
  10. The Residuals area provides four capabilities for viewing the deviations between the data points and the predicted points. Select as many as you like.

    I selected all four.

  11. The option in the Normal Probability Plot area produces a graph of the percentiles of the y-variable.

    I selected this one.

  12. Click OK.

Go back to the section “Data Analysis Tool: Regression” for the details of what’s in the output. It’s the same as earlier, with a couple of changes and additions because of the new variable. Figure 14-25 shows the Regression Statistics, the ANOVA table, and the Coefficients table.

image

FIGURE 14-25: Part of the output from the Regression data analysis tool: Regression Statistics, ANOVA table, and Coefficients table.

The ANOVA table shows the new df (2, 17, and 19 for Regression, Residual, and Total, respectively). The coefficients table adds information for the HS Average. It shows the values of all the coefficients, as well as standard errors, and t-test information for hypothesis testing.

If you go through the example, you’ll see the table of residuals in the output. Compare the absolute values of the residuals from the linear regression with the absolute values of the residuals from the multiple regression; you’ll see the multiple ones are smaller, on average.

The graphic output has some additions, too: A scatterplot of HS Average and GPA that also shows predicted GPAs, and a plot of residuals and HS Average.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset