Chapter 6
Multivariate Statistics

So far we have discussed inference methods for one variable at a time. Data analysts are also interested in multivariate inferential methods, where the relationships between two variables, or between one target variable and a set of predictor variables, are analyzed.

We begin with bivariate analysis, where we have two independent samples and wish to test for significant differences in the means or proportions of the two samples. When would data miners be interested in using bivariate analysis? In Chapter 6, we illustrate how the data is partitioned into a training data set and a test data set for cross-validation purposes. Data miners can use the hypothesis tests shown here to determine whether significant differences exist between the means of various variables in the training and test data sets. If such differences exist, then the cross-validation is invalid, because the training data set is nonrepresentative of the test data set.

  • For a continuous variable, use the two-sample t-test for the difference in means.
  • For a flag variable, use the two-sample Z-test for the difference in proportions.
  • For a multinomial variable, use the test for the homogeneity of proportions.

Of course, there are presumably many variables in each of the training set and test set. However, spot-checking of a few randomly chosen variables is usually sufficient.

6.1 Two-Sample t-Test for Difference in Means

To test for the difference in population means, we use the following test statistic:

equation

which follows an approximate t distribution with degrees of freedom the smaller of c06-math-0002 and c06-math-0003, whenever either both populations are normally distributed or both samples are large.

For example, we partitioned the churn data set into a training set of 2529 records and a test set of 804 records (the reader's partition will differ). We would like to assess the validity of the partition by testing whether the population mean number of customer service calls differs between the two data sets. The summary statistics are given in Table 6.1.

Table 6.1 Summary statistics for customer service calls, training data set, and test data set

Data Set Sample Mean Sample Standard Deviation Sample Size
Training set c06-math-0004 c06-math-0005 c06-math-0006
Test set c06-math-0007 c06-math-0008 c06-math-0009

Now, the sample means do not look very different, but we would like to have the results of the hypothesis test just to make sure. The hypotheses are

equation

The test statistic is

equation

The two-tailed p-value for c06-math-0012 is

equation

Since the p-value is large, there is no evidence that the mean number of customer service calls differs between the training data set and the test data set. For this variable at least, the partition seems valid.

6.2 Two-Sample Z-Test for Difference in Proportions

Of course not all variables are numeric, like customer service calls. What if we have a 0/1 flag variable – such as membership in the Voice Mail Plan – and wish to test whether the proportions of records with value 1 differ between the training data set and test data set? We could turn to the two-sample Z-test for the difference in proportions. The test statistic is

equation

where c06-math-0015, and c06-math-0016 and c06-math-0017 represents the number of and proportion of records with value 1 for sample i, respectively.

For example, our partition resulted in c06-math-0018 of c06-math-0019 customers in the training set belonging to the Voice Mail Plan, while c06-math-0020 of c06-math-0021 customers in the test set belonging, so that c06-math-0022, c06-math-0023, and c06-math-0024.

The hypotheses are

equation

The test statistic is

equation

The p-value is

equation

Thus, there is no evidence that the proportion of Voice Mail Plan members differs between the training and test data sets. For this variable, the partition is valid.

6.3 Test for the Homogeneity of Proportions

Multinomial data is an extension of binomial data to k > 2 categories. For example, suppose a multinomial variable marital status takes the values married, single, and other. Suppose we have a training set of 1000 people and a test set of 250 people, with the frequencies shown in Table 6.2.

Table 6.2 Observed frequencies

Data Set Married Single Other Total
Training set 410 340 250 1000
Test set 95 85 70 250
Total 505 425 320 1250

To determine whether significant differences exist between the multinomial proportions of the two data sets, we could turn to the test for the homogeneity of proportions.1 The hypotheses are

equation

To determine whether these observed frequencies represent proportions that are significantly different for the training and test data sets, we compare these observed frequencies with the expected frequencies that we would expect if c06-math-0029 were true. For example, to find the expected frequency for the number of married people in the training set, we (i) find the overall proportion of married people in both the training and test sets, c06-math-0030, and (ii) we multiply this overall proportion by the number of people in the training set, 1000, giving us the expected proportion of married people in the training set to be

equation

We use the overall proportion in (i) because c06-math-0032 states that the training and test proportions are equal. Generalizing, for each cell in the table, the expected frequencies are calculated as follows:

equation

Applying this formula to each cell in the table gives us the table of expected frequencies in Table 6.3.

Table 6.3 Expected frequencies

Data Set Married Single Other Total
Training set 404 340 256 1000
Test set 101 85 64 250
Total 505 425 320 1250

The observed frequencies (O) and the expected frequencies (E) are compared using a test statistic from the c06-math-0034 (chi-square) distribution:

equation

Large differences between the observed and expected frequencies, and thus a large value for c06-math-0036, will lead to a small p-value, and a rejection of the null hypothesis. Table 6.4 illustrates how the test statistic is calculated.

Table 6.4 Calculating the test statistic c06-math-0037

Cell Observed Frequency Expected Frequency c06-math-0038
Married, training 410 404 c06-math-0039
Married, test 95 101 c06-math-0040
Single, training 340 340 c06-math-0041
Single, test 85 85 c06-math-0042
Other, training 250 256 c06-math-0043
Other, test 70 64 c06-math-0044
c06-math-0045

The p-value is the area to the right of c06-math-0046 under the c06-math-0047 curve with degrees of freedom equal to (number of rows − 1) (number of columns − 1) = (1)(2) = 2:

equation

Because this p-value is large, there is no evidence that the observed frequencies represent proportions that are significantly different for the training and test data sets. In other words, for this variable, the partition is valid.

This concludes our coverage of the tests to apply when checking the validity of a partition.

6.4 Chi-Square Test for Goodness of Fit of Multinomial Data

Next, suppose a multinomial variable marital status takes the values married, single, and other, and suppose that we know that 40% of the individuals in the population are married, 35% are single, and 25% report another marital status. We are taking a sample and would like to determine whether the sample is representative of the population. We could turn to the c06-math-0049 (chi-square) goodness of fit test.

The hypotheses for this c06-math-0050 goodness of fit test would be as follows:

equation

Our sample of size n = 100, yields the following observed frequencies (represented by the letter “O”):

equation

To determine whether these counts represent proportions that are significantly different from those expressed in c06-math-0053, we compare these observed frequencies with the expected frequencies that we would expect if c06-math-0054 were true. If c06-math-0055 were true, then we would expect 40% of our sample of 100 individuals to be married, that is, the expected frequency for married is

equation

Similarly,

equation

These frequencies are compared using the test statistic:

equation

Again, large differences between the observed and expected frequencies, and thus a large value for c06-math-0059, will lead to a small p-value, and a rejection of the null hypothesis. Table 6.5 illustrates how the test statistic is calculated.

Table 6.5 Calculating the test statistic c06-math-0063

Marital Status Observed Frequency Expected Frequency c06-math-0064
Married 36 40 c06-math-0065
Single 35 35 c06-math-0066
Other 29 25 c06-math-0067
c06-math-0068

The p-value is the area to the right of c06-math-0060 under the c06-math-0061 curve with k − 1 degrees of freedom, where k is the number of categories (here k = 3):

equation

Thus, there is no evidence that the observed frequencies represent proportions that differ significantly from those in the null hypothesis. In other words, our sample is representative of the population.

6.5 Analysis of Variance

In an extension of the situation for the two-sample t-test, suppose that we have a threefold partition of the data set, and wish to test whether the mean value of a continuous variable is the same across all three subsets. We could turn to one-way analysis of variance (ANOVA). To understand how ANOVA works, consider the following small example. We have samples from Groups A, B, and C, of four observations each, for the continuous variable age, shown in Table 6.6.

Table 6.6 Sample ages for Groups A, B, and C

Group A Group B Group C
30 25 25
40 30 30
50 50 40
60 55 45

The hypotheses are

equation

The sample mean ages are c06-math-0070, c06-math-0071, and c06-math-0072. A comparison dot plot of the data (Figure 6.1) shows that there is a considerable amount of overlap among the three data sets. So, despite the difference in sample means, the dotplot offers little or no evidence to reject the null hypothesis that the population means are all equal.

c06f001

Figure 6.1 Dotplot of Groups A, B, and C shows considerable overlap.

Next, consider the following samples from Groups D, E, and F, for the continuous variable age, shown in Table 6.7.

Table 6.7 Sample ages for Groups D, E, and F

Group D Group E Group F
43 37 34
45 40 35
45 40 35
47 43 36

Once again, the sample mean ages are c06-math-0073, c06-math-0074, and c06-math-0075. A comparison dot plot of this data (Figure 6.2) illustrates that there is very little overlap among the three data sets. Thus, Figure 6.2 offers good evidence to reject the null hypothesis that the population means are all equal.

c06f002

Figure 6.2 Dotplot of Groups D, E, and F shows little overlap.

To recapitulate, Figure 6.1 shows no evidence of difference in group means, while Figure 6.2 shows good evidence of differences in group means, even though the respective sample means are the same in both cases. The distinction stems from the overlap among the groups, which itself is a result of the spread within each group. Note that the spread is large for each group in Figure 6.1, and small for each group in Figure 6.2. When the spread within each sample is large (Figure 6.1), the difference in sample means seems small. When the spread within each sample is small (Figure 6.2), the difference in sample means seems large.

ANOVA works by performing the following comparison. Compare

  1. the between-sample variability, that is, the variability in the sample means, such as c06-math-0076, c06-math-0077, and c06-math-0078, with
  2. the within-sample variability, that is, the variability within each sample, measured, for example, by the sample standard deviations.

When (1) is much larger than (2), this represents evidence that the population means are not equal. Thus, the analysis depends on measuring variability, hence the term analysis of variance.

Let c06-math-0079 represent the overall sample mean, that is, the mean of all observations from all groups. We measure the between-sample variability by finding the variance of the k sample means, weighted by sample size, and expressed as the mean square treatment (MSTR):

equation

We measure the within-sample variability by finding the weighted mean of the sample variances, expressed as the mean square error (MSE):

equation

We compare these two quantities by taking their ratio:

equation

which follows an F distribution, with degrees of freedom c06-math-0083 and c06-math-0084. The numerator of MSTR is the sum of squares treatment, SSTR, and the numerator of MSE is the sum of squares error, SSE. The total sum of squares (SST) is the sum of SSTR and SSE. A convenient way to display the above quantities is in the ANOVA table, shown in Table 6.8.

Table 6.8 ANOVA table

Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Treatment SSTR c06-math-0085 c06-math-0086 c06-math-0087
Error SSE c06-math-0088 c06-math-0089
Total SST

The test statistic c06-math-0090 will be large when the between-sample variability is much greater than the within-sample variability, which is indicative of a situation calling for rejection of the null hypothesis. The p-value is c06-math-0091; reject the null hypothesis when the p-value is small, which happens when c06-math-0092 is large.

For example, let us verify our claim that Figure 6.1 showed little or no evidence that the population means were not equal. Table 6.9 shows the Minitab ANOVA results.

Table 6.9 ANOVA results for H0 : μA = μB = μC

c06t009

The p-value of 0.548 indicates that there is no evidence against the null hypothesis that all population means are equal. This bears out our earlier claim. Next let us verify our claim that Figure 6.2 showed evidence that the population means were not equal. Table 6.10 shows the Minitab ANOVA results.

Table 6.10 ANOVA results for H0 : μD = μE = μF

c06t010

The p-value of approximately zero indicates that there is strong evidence that not all the population mean ages are equal, thus supporting our earlier claim. For more on ANOVA, see Larose (2013).2

Regression analysis represents another multivariate technique, comparing a single predictor with the target in the case of Simple Linear Regression, and comparing a set of predictors with the target in the case of Multiple Regression. We cover these topics in their own chapters, Chapters 8 and 9, respectively.

Reference

  1. Much more information regarding the topics covered in this chapter may be found in any introductory statistics textbook, such as Discovering Statistics, second edition, by Daniel T. Larose, W. H. Freeman, New York, 2013.

R Reference

  1. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. ISBN: 3-900051-07-0, http://www.R-project.org/.

Exercises

1. In Chapter 7, we will learn to split the data set into a training data set and a test data set. To test whether there exist unwanted differences between the training and test set, which hypothesis test do we perform, for the following types of variables:

  1. Flag variable
  2. Multinomial variable
  3. Continuous variable

Table 6.11 contains information on the mean duration of customer service calls between a training and a test data set. Test whether the partition is valid for this variable, using c06-math-0093.

Table 6.11 Summary statistics for duration of customer service calls

Data Set Sample Mean Sample Standard Deviation Sample Size
Training set c06-math-0096 c06-math-0097 c06-math-0098
Test set c06-math-0099 c06-math-0100 c06-math-0101

2. Our partition shows that 800 of the 2000 customers in our test set own a tablet, while 230 of the 600 customers in our training set own a tablet. Test whether the partition is valid for this variable, using c06-math-0094.

Table 6.12 contains the counts for the marital status variable for the training and test set data. Test whether the partition is valid for this variable, using c06-math-0095.

Table 6.12 Observed frequencies for marital status

Data Set Married Single Other Total
Training set 800 750 450 2000
Test set 240 250 110 600
Total 1040 1000 560 2600

3. The multinomial variable payment preference takes the values credit card, debit card, and check. Now, suppose we know that 50% of the customers in our population prefer to pay by credit card, 20% prefer debit card, and 30% prefer to pay by check. We have taken a sample from our population, and would like to determine whether it is representative of the population. The sample of size 200 shows 125 customers preferring to pay by credit card, 25 by debit card, and 50 by check. Test whether the sample is representative of the population, using c06-math-0102.

4. Suppose we wish to test for difference in population means among three groups.

  1. Explain why it is not sufficient to simply look at the differences among the sample means, without taking into account the variability within each group.
  2. Describe what we mean by between-sample variability and within-sample variability.
  3. Which statistics measure the concepts in (b).
  4. Explain how ANOVA would work in this situation.

Table 6.13 contains the amount spent (in dollars) in a random sample of purchases where the payment was made by credit card, debit card, and check, respectively. Test whether the population mean amount spent differs among the three groups, using c06-math-0103. Refer to the previous exercise. Now test whether the population mean amount spent differs among the three groups, using c06-math-0104. Describe any conflict between your two conclusions. Suggest at least two courses of action to ameliorate the situation.

Table 6.13 Purchase amounts for three payment methods

Credit Card Debit Card Check
100 80 50
110 120 70
90 90 80
100 110 80
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset