4.3 Measuring the Fit of the Regression Model

A regression equation can be developed for any variables X and Y, even random numbers. We certainly would not have any confidence in the ability of one random number to predict the value of another random number. How do we know that the model is actually helpful in predicting Y based on X? Should we have confidence in this model? Does the model provide better predictions (smaller errors) than simply using the average of the Y values?

In the Triple A Construction example, sales figures (Y) varied from a low of 4.5 to a high of 9.5, and the mean was 7. If each sales value is compared with the mean, we see how far they deviate from the mean, and we could compute a measure of the total variability in sales. Because Y is sometimes higher and sometimes lower than the mean, there may be both positive and negative deviations. Simply summing these values would be misleading because the negatives would cancel out the positives, making it appear that the numbers are closer to the mean than they actually are. To prevent this problem, we will use the sum of squares total (SST) to measure the total variability in Y:

SST=Σ(YY¯)2
(4-6)

If we did not use X to predict Y, we would simply use the mean of Y as the prediction, and the SST would measure the accuracy of our predictions. However, a regression line may be used to predict the value of Y, and while there are still errors involved, the sum of these squared errors will be less than the total sum of squares just computed. The sum of squares error (SSE) is

SSE=Σe2=Σ(YY^)2
(4-7)

Table 4.3 provides the calculations for the Triple A Construction example. The mean (Y¯=7) is compared to each value, and we get

SST=22.5

The prediction (Y^) for each observation is computed and compared to the actual value. This results in

SSE=6.875

The SSE is much lower than the SST. Using the regression line has reduced the variability in the sum of squares by 22.56.875=15.625. This is called the sum of squares regression (SSR) and indicates how much of the total variability in Y is explained by the regression model. Mathematically, this can be calculated as

SSR=Σ(Y^Y¯)2
(4-8)

Table 4.3 indicates

SSR=15.625

There is a very important relationship among the sums of squares that we have computed:

(Sum of squares total)=(Sum of squares due to regression)+(Sum of squares error)SST=SSR+SSE
(4-9)

Figure 4.2 displays the data for Triple A Construction. The regression line is shown, as is a line representing the mean of the Y values. The errors used in computing the sums of squares are shown on this graph. Notice how the sample points are closer to the regression line than they are to the mean.

Table 4.3 Sum of Squares for Triple A Construction

Y X (YY¯)2 Y^ (YY^)2 (Y^Y¯)2
6 3 (67)2=1 2+1.25(3)=5.75 0.0625 1.563
8 4 (87)2=1 2+1.25(4)=7.00 1 0
9 6 (97)2=4 2+1.25(6)=9.50 0.25 6.25
5 4 (57)2=4 2+1.25(4)=7.00 4 0
4.5 2 (4.57)2=6.25 2+1.25(2)=4.50 0 6.25
9.5 5 (9.57)2=6.25 2+1.25(5)=8.25 1.5625 1.563
Y¯=7

Σ(YY¯)2=22.5SST=22.5

Σ(YY^)2=6.875SSE=6.875

Σ(Y^Y¯)2=15.625SSR=15.625

The scatter diagram from the previous figure is replicated.

Figure 4.2 Deviations from the Regression Line and from the Mean

Coefficient of Determination

The SSR is sometimes called the explained variability in Y, while the SSE is the unexplained variability in Y. The proportion of the variability in Y that is explained by the regression equation is called the coefficient of determination and is denoted by r2. Thus,

r2=SSRSST=1SSESST
(4-10)

Either the SSR or the SSE can be used to find r2. For Triple A Construction, we have

r2=15.62522.5=0.6944

This means that about 69% of the variability in sales (Y) is explained by the regression equation based on payroll (X).

If every point in the sample were on the regression line (meaning all errors are 0), then 100% of the variability in Y could be explained by the regression equation, so r2=1 and SSE=0. The lowest possible value of r2 is 0, indicating that X explains 0% of the variability in Y. Thus, r2 can range from a low of 0 to a high of 1. In developing regression equations, a good model will have an r2 value close to 1.

Correlation Coefficient

Another measure related to the coefficient of determination is the coefficient of correlation. This measure also expresses the degree or strength of the linear relationship. It is usually expressed as r and can be any number between and including +1 and 1. Figure 4.3 illustrates possible scatter diagrams for different values of r. The value of r is the square root of r2. It is negative if the slope is negative, and it is positive if the slope is positive. Thus,

r=±r2
(4-11)

For the Triple A Construction example with r2=0.6944,

r=0.6944=0.8333

We know it is positive because the slope is 1.25.

A series of four scatter plots labelled a, b, c, and d.

Figure 4.3 Four Values of the Correlation Coefficient

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset