Both the MSE and provide a measure of accuracy in a regression model. However, when the sample size is too small, it is possible to get good values for both of these even if there is no relationship between the variables in the regression model. To determine whether these values are meaningful, it is necessary to test the model for significance.
To see if there is a linear relationship between X and Y, a statistical hypothesis test is performed. The underlying linear model was given in Equation 4-1 as
If then Y does not depend on X in any way. The null hypothesis says there is no linear relationship between the two variables (i.e., ). The alternate hypothesis is that there is a linear relationship (i.e., ). If the null hypothesis can be rejected, then we have proven that a linear relationship does exist, so X is helpful in predicting Y. The F distribution is used for testing this hypothesis. Appendix D contains values for the F distribution that can be used when calculations are performed by hand. See Chapter 2 for a review of the F distribution. The results of the test can also be obtained from both Excel and QM for Windows.
The F statistic used in the hypothesis test is based on the MSE (seen in the previous section) and the mean squared regression (MSR). The MSR is calculated as
where
The F statistic is
Based on the assumptions regarding the errors in a regression model, this calculated F statistic is described by the F distribution with
where
If there is very little error, the denominator (MSE) of the F statistic is very small relative to the numerator (MSR), and the resulting F statistic will be large. This is an indication that the model is useful. A significance level related to the value of the F statistic is then found. Whenever the F value is large, the observed significance level (p-value) will be low, indicating that it is extremely unlikely that this could have occurred by chance. When the F value is large (with a resulting small significance level), we can reject the null hypothesis that there is no linear relationship. This means that there is a linear relationship and the values of MSE and are meaningful.
The hypothesis test just described is summarized here:
Specify null and alternative hypotheses:
Select the level of significance Common values are 0.01 and 0.05.
Calculate the value of the test statistic using the formula
Make a decision using one of the following methods:
Reject the null hypothesis if the test statistic is greater than the F value from the table in Appendix D. Otherwise, do not reject the null hypothesis:
Reject the null hypothesis if the observed significance level, or p-value, is less than the level of significance (). Otherwise, do not reject the null hypothesis:
To illustrate the process of testing the hypothesis about a significant relationship, consider the Triple A Construction example. Appendix D will be used to provide values for the F distribution.
Step 1.
Step 2.
Step 3. Calculate the value of the test statistic. The MSE was already calculated to be 1.7188. The MSR is then calculated so that F can be found:
Step 4.
Reject the null hypothesis if the test statistic is greater than the F value from the table in Appendix D :
The value of F associated with a 5% level of significance and with degrees of freedom 1 and 4 is found in Appendix D . Figure 4.5 illustrates this:
Thus, there is sufficient data to conclude that there is a statistically significant relationship between X and Y, so the model is helpful. The strength of this relationship is measured by Thus, we can conclude that about 69% of the variability in sales (Y) is explained by the regression model based on local payroll (X).
When software such as Excel or QM for Windows is used to develop regression models, the output provides the observed significance level, or p-value, for the calculated F value. This is then compared to the level of significance to make the decision.
Table 4.4 provides summary information about the ANOVA table. This shows how the numbers in the last three columns of the table are computed. The last column of this table, labeled Significance F, is the p-value, or observed significance level, which can be used in the hypothesis test about the regression model.
DF | SS | MS | F | SIGNIFICANCE F | |
---|---|---|---|---|---|
Regression | k | SSR | MSR = SSR/k | MSR/MSE | P(F > MSR/MSE) |
Residual | n −k − 1 | SSE | MSE = SSE / (n −k − 1) | ||
Total | n − 1 | SST |
The Excel output that includes the ANOVA table for the Triple A Construction data is shown in the next section. The observed significance level for is given to be 0.0394. This means
Because this probability is less than we would reject the hypothesis of no linear relationship and conclude that there is a linear relationship between X and Y. Note in Figure 4.5 that the area under the curve to the right of 9.09 is clearly less than 0.05, which is the area to the right of the F value associated with a 0.05 level of significance.