CHAPTER 24

Analysis of Frequencies, Analysis of Variance, Regression and Correlation Analysis

He who knows not and knows not what he knows not, is a fool.
Shun him.
He who knows and knows not is simple.
Teach him.
He who knows, and knows not that he knows, is asleep.
Wake him.
He who knows that he knows what he knows, is wise.
Follow him.

– Arab Proverb

SYNOPSIS

The logic and methodology of statistical tests of significance discussed in the previous chapter is extended in this chapter to cover comparison of results in more than two samples when data is in attribute type as well as measurement (variable) type. Method of analysis for attribute type is termed as ‘frequency analysis’ and that for measurement is referred to as ‘analysis of variance’.

Analysis of frequencies: illustration

Type 1

Table 24.1 gives the data of 660 patients in a hospital stratified as per (a) the type of insurance they had and (b) the number of days of hospital-stay. There are three types of insurance and three categories of hospital stay as can be seen in Table 24.1. The area of interest is to know whether there is an association between the types of insurance and length of stay in the hospital. The data is summarised in Table 24.1 and details of analysis in Tables 24.2 and 24.3.

TABLE 24.1 Number of Patients Classified as per Duration of Stay and Type of Insurance

Type of Insurance

 

TABLE 24.2 Test of Significance: Frequencies

Step no. as per Table 23.3 Answer
1. Is the duration of stay statistically associated with the type of insurance on the assumption that it is not so?
2. Chi-square distribution with associated degrees of freedom (d.f.)
3. Test statistic is ch24-ueq1 with d.f. = (r − 1)(c − 1)
where fo is observed frequency, fe expected frequency, r number of rows, c number of columns and Σ the symbol for the summation of ch24-ueq2 over each of the cells numbering r × c
4. Computed value of statistic χ2 based on data is 24.315 with d.f. = (3 − 1)(3 − 1) = 4.
Details of calculations are presented in Table 24.3
5. The type of test: one-sided, higher
6. Level of significance = α = 0.01
7. Value of χ20.01, 4 from Table C in Chapter 23 is 13.277
8. Computed value of the test statistic 24.315 with d.f. 4 exceeds the value obtained from χ2 tables from Table C in Chapter 23
9. There is a reason to believe in statistical association between duration and insurance

Note: When the computed value of χ2 is very low, it means that the observed frequency remarkably tallies with the expected frequency. In such a case, the validity of data that has led to such a coincidence is examined. In fact, when Mendel’s pea data were subjected to Chi-square test, the Chi-square value was so small that it led to the suspicion that the data were fudged.



TABLE 24.3 Calculation of (fofe)2/fe

Calculation

Type 2

There are six general hospitals A, B, C, D, E and F run by the city corporation. The data on monthly average number of in-patients and their discharges are given in Table 24.4. The area of interest is to know whether the discharge rates differ from hospital to hospital (Table 24.5).

 

TABLE 24.4 Hospital and Patient Load

Hospital and Patient Load

 

TABLE 24.5 Tests of Significance: Comparison of Frequencies

Step no. as per Table 23.3 Answer
1. Do the discharge rates differ from hospital to hospital assuming that it is not so?
2. χ2 distribution
3. Test-statistic χ2 is given by
ch24-ueq3
k is number of hospitals (groups), ni sample size for ith group (in-patients), xi number in the sample ith group with point of interest (discharge), n total of all the samples in k groups, ch24-ueq4 = total of all the numbers with point of interest in k groups, ch24-ueq5
4. For the data on hand, it can be seen that k = 6, n = 995, x = 148
ch24-ueq6
5. The type of test: one-sided, higher
6. Level of significance, α = 0.05
7. Value of χ20.05, 5 from the Table C in Chapter 23 is 11.07
8. The value of χ2 computed from the data does not exceed that obtained from the table
9. Hence there is no reason to believe that the rate of discharge varies significantly from hospital to hospital

Note: Comparison of frequencies can also be looked upon as comparison of proportions.

Analysis of variance (ANOVA): comparison of averages of more than two samples, one way classification

ANOVA is, in fact, the analysis of means by comparing the variances and is explained by an example.

There are four different training methods. Each method is meant to increase the output. Sixteen new employees were assigned for training, four for each method. Data on output obtained after training is given in Table 24.6.

TABLE 24.6 Output and Training Method

Output and Training Method

Now the point of interest is to know whether the average outputs are statistically different from one another reflecting the superiority of certain method(s). This comparison of more than two means is done by ANOVA.

Here, the data is stratified on the basis of the factor training method. Hence, the ANOVA is called ANOVA with one-way classification. There are also ANOVAs with more than one-way classification. The point to be noted is the logic of the method of analysis, which is common to any type of ANOVA. Steps for carrying out the test of significance for the above data are given in Table 24.7.

Certain points to note

Before proceeding with the steps involved in the test of significance of the differences in the averages between the samples, the following points are to be noted.

  1. Variance ‘s2’ in the set of observations can be computed based on values within each teaching method. This is termed as variance within a sample and all the within-sample variances can be pooled. This is also referred to as error variance.
  2. Variance ‘s2’ can be computed based on the averages of the sample.
  3. On the assumption that the sample averages do not differ statistically, the two variances obtained in (a) and (b) do not statistically differ from one another. This is as per statistical law.
  4. Therefore, the two variances in (a) and (b) are compared through F-test to establish their statistical significance.
  5. Here, F is defined as:
    ch24-ueq7
    with d.f.1 and d.f.2 corresponding to higher and lower variance, respectively.
    Higher variance is always taken as variance due to columns (methods) and lower variance is always taken as variance within samples. One-side α (higher) test is applied.
  6. In case the test turns out to be not significant, it means that error variance is so high that it has masked the possible differences if any, due to column (methods). Hence, the way the data are collected needs to be examined. One should look for the influence of any special cause operating in each column that may be masking the effect of the factor represented by the column.

Test of significance—comparison of means (one-way ANOVA)

TABLE 24.7 Test of Significance—Comparison of Means

Step no. as per Table 23.3 Answer
1. Do the averages of output obtained from each method differ statistically from one another on the assumption that it is not so?
2. F-distribution
3. Test statistic F is
ch24-ueq8
with degrees of freedom d.f.c and d.f.e corresponding columns and error variance, respectively
4. The computed value of the test statistic F from data is F = 2.45 with d.f. 3 and 12. Table 24.8 gives the details of the calculations
5. The type of test: one-sided higher
6. Level of significance = α = 0.05
7. Value of F0.05, 3, 12 from the Table D in Chapter 23 is 3.49
8. Computed value of F is less than that obtained from the table
9. There is a reason to believe that no method(s) has impact on increasing the output. All are equal in their impact. Note point no. 6 in the previous page where a non-significant test of significance is explained.

 

TABLE 24.8 Calculations: ANOVA One-way Classification

Calculations

ANOVA: two-way classification

Data on doffing time was compiled for 10 operators on each of five machines. The object is to know whether any difference exists in the efficiency of the worker or in the working of the machines. Data is presented in Table 24.9.

TABLE 24.9 Doffing Time Classified Operator and Machine Line

Operator and Machine Line

Unit: min.

Now the point of interest is to know whether the average doffing time is statistically different from

  1. One operator to another
  2. One machine to another

Here, the data is stratified in two-ways, machine-wise and operator-wise, two known special cases. The same points stated in the case of one-way holds good here also. It means that in addition, variance due to ‘rows’ (machines) need to be obtained and compared with error variance through F test.

There is one more special cause. That is the interaction of machines and operators, whereby an operator may be prone to perform better on a few machines when compared to other machines. To test this phenomenon, the data to be collected needs modification. Multiple observations (same number) need to be taken for each machine and operator.

For this data, tests of significance are applied as given in Table 24.10 and calculations in Table 24.11.

TABLE 24.10 Test of Significance—Comparison of Means (Two-way ANOVA)

Step no. as per Table 23.3 Answer
1.
  1. Does the average doffing time statistically differ from operator to operator?
  2. Does the average doffing time statistically differ from machine to machine?

Under the assumption that they do not differ
2. F distribution
3. Test statistic F is
  1. ch24-ueq9
    Degrees of freedom d.f.c and d.f.e corresponding to columns (operators) and error variance, respectively
  2. ch24-ueq10
    Degrees of freedom d.f.c and d.f.e corresponding to rows (machines) and error variance, respectively
4. Computed value of F corresponding to the above
  1. F (column) is less than 1 with d.f. 9, 36
  2. (rows) is less than 1 with d.f. 4, 36
Table 24.3 gives the details of the calculation
5. The type of test: single-sided, higher
6. Level of significance = α = 0.05
7. Value of F0.05, 9, 36 = 2.12 and F0.05, 4, 36 = 2.61 from Table D in Chapter 23
8. Computed value of F in both the cases are less than the values of F obtained from the table corresponding to each
9. There is no difference in doffing time either between operators or machines

 

TABLE 24.11 Calculations: ANOVA Two-way Classification

Calculations: ANOVA

Components of variation

When the result of analysis shows statistical significance, the analysis of variance model helps to assess the magnitude of variation due to rows and columns. This knowledge is valuable as it indicates that the factors represented by rows and columns are a source of variation. Once this is recognised it leads to several technological investigations on what factors affect and the reasons of variation thus leading to actions to reduce variation.

In a two-way example of five rows and 10 columns, the ANOVA had shown the results as given in Table 24.12.

TABLE 24.12 ANOVA Table (Two-way Without Replication)

ANOVA Table

The results indicate that the factor represented by rows as well as that represented by columns have a significant effect. What is the magnitude of variation contributed by each? This is furnished by components of variation. Its analysis is as follows as applicable to the two-way example in Table 24.12.

ch24-ueq11

Thus priority has to be given to the factor represented by rows in the present case machines. Find the cause, take action and evaluate the effectiveness. Actions taken are effective, if re-evaluation has shown that variability is below 1.76.

Next, the variability of 3.39 itself is high. In this, the interaction between the factors represented by row and columns is included. For the machine–operator example, this interaction can mean that a few operators may be comfortable to handle certain machines while others are not. This needs to be investigated.

From the discussion, it follows that it is necessary to collect special data on the factors and responses in a manner that admits ANOVA analysis to get newer insights into the sources of variation and the means to reduce variation.

Regression analysis

In Chapter 19, the scatter diagram is discussed. Suppose the scatter diagram on a certain pair of factor (x) and its response (y) shows that they are related as under, the question that arises is whether the linear relationship shown by the scatter diagram (Figure 24.1) is statistically significant to accept that x and y are linearly related.

Figure 24.1 Scatter diagram

Scatter diagram

In both the cases, the nature of relationship is linear and hence, it is possible to fit data in a straight line y = b + ax.

If the relationship is statistically significant, it means that statistically a differs from zero and same is the case with b.

In the case of ‘linearity’ of an instrument, if both a and b are not statistically different from zero, then the instrument has linearity. To be free from the linearity deficiency, a and b should be zero ‘statistically’.

Exercise: regression analysis

In an investigation on linearity of an instrument, the following data was obtained (Table 24.13). Average bias was based on 12 repeated measurements at each of the five reference values made on the same component matched with the reference value.

TABLE 24.13 Data on Reference Value and Bias

Ref. value (x) Average bias (y)
2 0.492
4 0.125
6 0.025
8 −0.292
10 −0.617

Is there reason to believe that the instrument is free from linearity statistically?
To answer this question, the first step is to have line of best fit for the data like

yi = axi + b

where xi is the reference value, yi the average bias corresponding to xi, a the slope and b the intercept.

The next step is to subject the values a and b for statistical test of significance to know whether each one is statistically different from zero. If they statistically differ from zero, there is linearity.

Fitting the line of best fit

The layout of the data on reference value (x) and average bias (y) are shown in Table 24.14. Note that the number of observations n is five (5). Complete the calculations as indicated in Table 24.14.

TABLE 24.14 Layout of Data and Calculations

Data and Calculations

It is required to fit the line of best fit to the given data

y = ax + b, where a is the slope and b the intercept.

The slope of the best-fitting estimating line,

ch24-ueq12

and the intercept of the best-fitting line,

ch24-ueq13

Thus, the line of best fit is y = 0.73710 − 0.13175x. For any value of ‘x’, the average bias is estimated by the given equation (Table 24.15).

TABLE 24.15 Tests of Significance for a and b of Regression Line

Step no. as per Table 23.3 Answer
1. Are each of the values of a = −0.1375 and b = 0.7371 different from zero statistically on the assumption that it is not so?
2. t distribution is associated with d.f.
3. Test statistic t for a and b are as under:
ch24-ueq14
4. Computed value of |t| for a and b are 11.53 and 9.8 with d.f. for each being 3. Details of the calculation of t values are in Table 24.16
5. Type of test: two-sided
6. Level of significance = α = 0.05
7. Value of t0.025, 3 obtained from t-Table is 3.182
8. Calculated values of t corresponding to a as well as b exceed the value of t obtained from the table
9. The value of a as well as b are statistically different from zero. Hence, linearity is confirmed

 

TABLE 24.16 Calculations of t Values for a and b

1. Note that the values of a, b and n are
a = −0.13175 (slope) and b = 0.7371 (intercept) n = 5
2. ch24-ueq15
The following results can be verified from the layout of data already given Σy2 = 0.724267 bΣy = −0.19686 aΣxy = 0.905386
3. ch24-ueq16
4. ch24-ueq17
5. ch24-ueq18
6. Thus t value for a is
ch24-ueq19

Correlation analysis

Correlation analysis is another statistical tool used to describe the degree to which one variable (x) is linearly related to another.

Consider the illustration given in Figure 24.2.

 

Figure 24.2 Understanding variation due to relationship

relationship

The linear relationship between x (temperature) and y (yield) as well as that between x (temperature) and y (purity) is found to be statistically significant.

From Figure 24.2, it can be seen that total variability in y is due to two sources (a) variability in y at any given value of x and (b) variability in y caused by changing the value of x from x1 to x2, x3, …, x5, termed as due to regression. It can also be noted that the variability in y due to regression is more pronounced in the case of purity compared to that of yield. This is a very valuable technological insight that could be appropriately used. This is a common feature of any regression analysis. This points to the need for having a measure of the total variability in y that is explained by regression.

Figure 24.3 Understanding inverse, direct and no relationship

Understanding inverse

Consider the three possible situations given in Figure 24.3 governing the relationship between x and y.

A measure of association that describes the given type of situation is also needed as a supplement to regression analysis.

Thus, two measures are needed to understand and interpret regression line after its statistical significance is established. They are

  1. Coefficient of determination r2, that explains how much of the total variation in y is due to linear relationship with x.
  2. Coefficient of correlation r, to indicate the strength of the relationship between x and y.

Coefficient of determination r2, when the line of regression fitted is y = a + bx is obtained as under.

ch24-ueq20

where a is the intercept, b the slope, n the number of data points, x the value of independent variables, y the value of dependent variables and ysbar the mean of the observed values of y.

Illustrative example

For the data on R&D expenditure (x) and profit (y) obtain the value of r2. The regression line for the data is y = 20 + 2x (Table 24.17).

Approximately, 82.6 per cent of the variation in (y) profit is explained by variation in (x) R&D expenditure.

TABLE 24.17 Layout of Data and Calculations

Data and Calculations
ch24-ueq21

Certain points to note: interpretation

  1. Cause–effect way of interpretation. The cause–effect relationship is generally not implied in regression analysis.
    It may not be so where technical factors such as temperature, pressure, rate of addition, yield and purity are involved, provided there is an underlying technological reason for cause–effect relationship. In such cases, regression analysis helps to look at the relationship more closely and with the authority of data.
    In majority of the cases where non-technical factors—sociological, managerial—are involved, cause–effect way of looking at regression analysis may be ridiculous. In the above example, if one were to apply the cause–effect logic to interpret the results, it amounts to saying increase expenditure on R&D in order to increase profit. This is ridiculous. The correct interpretation would be that there is a strong association to suggest that a well thought out plan of flow of more funds for R&D and their effective utilisation would enhance profits.
  2. Extrapolation. The regression line is valid only over the range as the one from which the sample was taken. Hence, extrapolation by estimating the result or conjecturing without verification that the same type of result would be applicable even beyond the range included in the regression line can be erroneous.
  3. While using the results of the past regression analysis, check for the continued validity of the past data and if found valid, use the past results. Validity covers technology, practices, methods, etc.
  4. Coefficient of correlation. The coefficient of correlation is denoted by r and ch24-ueq23 where the square root value takes the positive sign (+) when the value of the slope in the regression line is positive and it takes the negative sign (−) when the value of the slope is negative. Thus, the numerical value of r should have its sign + or −.

    For the exercise on R&D expenditure and profit, the sign of the slope in its regression line is positive.

    Therefore, ch24-ueq22.

    Coefficient of correlation r is subjected to misinterpretation as a value that explains the amount of variation due to regression. For example, if r = 0.75, to interpret this as 75 per cent of the total variation in (y) due to its relationship with (x) is wrong. Such an interpretation is left to r2 and not r.

Conclusion

Practitioners of continual improvement need to be familiar with the analysis of attribute data by frequency analysis and variable data by analysis of variance, regression and correlation analysis. More fundamentally, we should be aware of the method of collecting the data to facilitate appropriate analysis. These can be learnt only with the knowledge of basic mathematics. One may not get many opportunities to use these techniques routinely. This is not an important point; Importance lies in the way where one's ability to think logically gets enhanced when exposed to such techniques. On this issue one cannot afford to have an elitist view to confine learning programmes on higher order statistical techniques only to certain categories of people.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset