Chapter 15

Correlation: The Rise and Fall of Relationships

IN THIS CHAPTER

Understanding what correlation is all about

Connecting correlation and regression

Drawing conclusions from correlations

Analyzing items

In Chapter 14, I show you the ins and outs of regression, a tool for summarizing relationships between (and among) variables. In this chapter, I introduce you to the ups and downs of correlation, another tool for looking at relationships.

I use the example of SAT scores and GPA from Chapter 14, and show how to think about the data in a slightly different way. The new concepts connect to what I show you in the preceding chapter, and you’ll see how that works. I also show you how to test hypotheses about relationships and how to use Excel functions and data analysis tools for correlation.

Scatterplots Again

A scatterplot is a graphical way of showing a relationship between two variables. Figure 15-1 is a scatterplot that represents the GPAs and SAT scores of 20 students at the fictional Sahutsket University. The GPAs are on a 4.0 scale and the SATs are total SAT scores.

image

FIGURE 15-1: A scatterplot of 20 students’ SATs and GPAs at Sahusket University.

Each point represents one student. A point’s location in the horizontal direction represents the student’s SAT. That same point’s location in the vertical direction represents the student’s GPA.

Understanding Correlation

In Chapter 14, I refer to the SAT as the independent variable and to the GPA as the dependent variable. The objective in Chapter 14 is to use SAT to predict GPA. Here’s a very important point: Although I use scores on one variable to predict scores on the other, I do not mean that the score on one variable causes a score on the other. Relationship doesn’t necessarily mean causality.

Correlation is a statistical way of looking at a relationship. When two things are correlated, it means that they vary together. Positive correlation means that high scores on one are associated with high scores on the other, and that low scores on one are associated with low scores on the other. The scatterplot in Figure 15-1 is an example of positive correlation.

Negative correlation, on the other hand, means that high scores on the first thing are associated with low scores on the second. Negative correlation also means that low scores on the first are associated with high scores on the second. An example is the correlation between body weight and the time spent on a weight-loss program. If the program is effective, the higher the amount of time spent on the program, the lower the body weight. Also, the lower the amount of time spent on the program, the higher the body weight.

Table 15-1, a repeat of Table 14-2, shows the data from the scatterplot.

TABLE 15-1 SAT Scores and GPAs for 20 Sahutsket University Students

Student

SAT

GPA

1

990

2.2

2

1150

3.2

3

1080

2.6

4

1100

3.3

5

1280

3.8

6

990

2.2

7

1110

3.2

8

920

2.0

9

1000

2.2

10

1200

3.6

11

1000

2.1

12

1150

2.8

13

1070

2.2

14

1120

2.1

15

1250

2.4

16

1020

2.2

17

1060

2.3

18

1550

3.9

19

1480

3.8

20

1010

2.0

Mean

1126.5

2.705

Variance

26171.32

0.46

Standard Deviation

161.78

0.82

In keeping with the way I use SAT and GPA in Chapter 14, SAT is the x-variable and GPA is the y-variable.

The formula for calculating the correlation between the two is

images

The term on the left, r, is called the correlation coefficient. It’s also called Pearson’s product-moment correlation coefficient, after its creator, Karl Pearson.

The two terms in the denominator on the right are the standard deviation of the x-variable and the standard deviation of the y-variable. The term in the numerator is called the covariance. So another way to write this formula is

images

The covariance represents x and y varying together. Dividing the covariance by the product of the two standard deviations imposes some limits. The lower limit of the correlation coefficient is –1.00, and the upper limit is +1.00.

A correlation coefficient of –1.00 represents perfect negative correlation (low x-scores associated with high y-scores, and high x-scores associated with low y-scores). A correlation of +1.00 represents perfect positive correlation (low x-scores associated with low y-scores, and high x-scores associated with high y-scores). A correlation of 0.00 means that the two variables are not related.

Applying the formula to the data in Table 15-1,

images

What, exactly, does this number mean? I’m about to tell you.

Correlation and Regression

Figure 15-2 shows the scatterplot with the line that "best fits" the points. It’s possible to draw an infinite number of lines through these points. Which one is best?

image

FIGURE 15-2: Scatterplot of the 20 students, including the regression line.

To be “best,” a line has to meet a specific standard: If you draw the distances in the vertical direction between the points and the line, and you square those distances, and then you add those squared distances, the best-fitting line is the one that makes the sum of those squared distances as small as possible. This line is called the regression line.

The regression line’s purpose in life is to enable you to make predictions. As I mention in Chapter 14, without a regression line, your best predicted value of the y-variable is the mean of the y’s. A regression line takes the x-variable into account and delivers a more precise prediction. Each point on the regression line represents a predicted value for y. In the symbology of regression, each predicted value is a y’.

Why do I tell you all of this? Because correlation is closely related to regression. Figure 15-3 focuses on one point in the scatterplot, and its distance to the regression line and to the mean. (This is a repeat of Figure 14-3.)

image

FIGURE 15-3: One point in the scatterplot and its associated distances.

Notice the three distances laid out in the figure. The distance labeled (y-y’) is the difference between the point and the regression line’s prediction for where the point should be. (In Chapter 14, I call that a residual.) The distance labeled images is the difference between the point and the mean of the y’s. The distance labeled images is the gain in prediction capability that you get from using the regression line to predict the point instead of using the mean to predict the point.

Figure 15-3 shows the three distances are related like this:

images

As I point out in Chapter 14, you can square all the residuals and add them, square all the deviations of the predicted points from the mean and add them, and square all the deviations of the actual points from the mean and add them, too.

It turns out that these sums of squares are related in the same way as the deviations I just showed you:

images

If SSRegression is large in comparison to SSResidual, it indicates that the relationship between the x-variable and the y-variable is a strong one. It means that throughout the scatterplot, the variability around the regression line is small.

On the other hand, if SSRegression is small in comparison to SSResidual, it means that the relationship between the x-variable and the y-variable is weak. In this case, the variability around the regression line is large throughout the scatterplot.

One way to test SSRegression against SSResidual is to divide each by its degrees of freedom (1 for SSRegression and N-2 for SSResidual) to form variance estimates (also known as Mean Squares, or MS), and then divide one by the other to calculate an F. If MSRegression is significantly larger than MSResidual, you have evidence that the x-y relationship is strong. (See Chapter 14 for details.)

Here’s the clincher, as far as correlation is concerned: Another way to assess the size of SSRegression is to compare it with SSTotal. Divide the first by the second. If the ratio is large, this tells you the x-y relationship is strong. This ratio has a name. It’s called the coefficient of determination. Its symbol is r2. Take the square root of this coefficient, and you have … the correlation coefficient!

images

The plus-or-minus sign (±) means that r is either the positive or negative square root, depending on whether the slope of the regression line is positive or negative.

So, if you calculate a correlation coefficient and you quickly want to know what its value signifies, just square it. The answer — the coefficient of determination — lets you know the proportion of the SSTotal that’s tied up in the relationship between the x-variable and the y-variable. If it’s a large proportion, the correlation coefficient signifies a strong relationship. If it’s a small proportion, the correlation coefficient signifies a weak relationship.

In the GPA-SAT example, the correlation coefficient is .817. The coefficient of determination is

images

In this sample of 20 students, the SSRegression is 66.7 percent of the SSTotal. Sounds like a large proportion, but what’s large? What’s small? These questions scream out for hypothesis tests.

Testing Hypotheses About Correlation

In this section, I show you how to answer important questions about correlation. Like any other kind of hypothesis testing, the idea is to use sample statistics to make inferences about population parameters. Here, the sample statistic is r, the correlation coefficient. By convention, the population parameter is ρ (rho), the Greek equivalent of r. (Yes, it does look like the letter p, but it really is the Greek equivalent of r.)

Two kinds of questions are important in connection with correlation: (1) Is a correlation coefficient greater than zero? (2) Are two correlation coefficients different from one another?

Is a correlation coefficient greater than zero?

Returning once again to the Sahutsket SAT-GPA example. You can use the sample r to test hypotheses about the population ρ — the correlation coefficient for all students at Sahutsket University.

Assuming you know in advance (before you gather any sample data) that any correlation between SAT and GPA should be positive, the hypotheses are

H0: ρ ≤ 0

H1: ρ > 0

I set α = .05

The appropriate statistical test is a t-test. The formula is

images

This test has N-2 df.

For the example, the values in the numerator are set: r is .817 and ρ (in H0) is zero. What about the denominator? I won’t burden you with the details. I’ll just tell you that it’s

images

With a little algebra, the formula for the t-test simplifies to

images

For the example,

images

With df = 18 and α = .05 (one-tailed), the critical value of t is 2.10 (use the worksheet function TINV to check). Because the calculated value is greater than the critical value, the decision is to reject H0.

Do two correlation coefficients differ?

In a sample of 24 students at Farshimmelt College, the correlation between SAT and GPA is .752. Is this different from the correlation (.817) at Sahutsket University? If you have no way of assuming that one correlation should be higher than the other, the hypotheses are

H0: ρSahusket = ρFarshimmelt

H1: ρSahusket ≠ ρFarshimmelt

Again, α = .05.

For highly technical reasons, you can’t set up a t-test for this one. In fact, you can’t even work with .817 and .752, the two correlation coefficients.

Instead, what you do is transform each correlation coefficient into something else and then work with the two “something elses” in a formula that gives you — believe it or not — a z-test.

technicalstuff The transformation is called Fisher’s r to z transformation. Fisher is the statistician who’s remembered as the F in the F-test. He transforms the r into a z by doing this:

images

If you know what loge means, fine. If not, don’t worry about it. (I explain it in Chapter 22.) Excel takes care of all of this for you, as you see in a moment.

Anyway, for this example

images

After you transform r to z, the formula is

images

The denominator turns out to be easier than you might think. It is

images

For this example:

images

The whole formula is

images

The next step is to compare the calculated value to a standard normal distribution. For a two-tailed test with α = .05, the critical values in a standard normal distribution are 1.96 in the upper tail and -1.96 in the lower tail. The calculated value falls between those two, so the decision is to not reject H0.

Worksheet Functions for Correlation

Excel provides two worksheet functions for calculating correlation — and, they do exactly the same thing in exactly the same way! Why Excel offers both CORREL and PEARSON I do not know, but there you have it. Those are the two main correlation functions.

The others are RSQ, COVARIANCE.P, and COVARIANCE.S. RSQ calculates the coefficient of determination (the square of the correlation coefficient). COVARIANCE.P determines covariance the way I show you in the earlier section “Understanding Correlation.” It uses N-1. COVARIANCE.S uses N.

CORREL and PEARSON

Figure 15-4 shows the data for the Sahutsket SAT-GPA example, along with the Function Arguments dialog box for CORREL.

image

FIGURE 15-4: The Function Arguments dialog box for CORREL, along with data.

To use this function, follow these steps:

  1. Type the data into cell arrays and select a cell for CORREL’s answer.

    I’ve entered the SAT data into C3:C22 and the GPA data into D3:D22, and selected F15. I’ve defined SAT as the name of C3:C22 and GPA as the name of D3:D22. (Read Chapter 2 to see how to do this.)

  2. From the Statistical Functions menu, select CORREL to open its Function Arguments dialog box.
  3. In the Function Arguments dialog box, enter the appropriate values for the arguments.

    In the Array1 box, I entered SAT — the name I assigned to the cell range (C3:C22) that holds the scores for one of the variables.

    In the Array2 box, I entered GPA — the name I assigned to the cell range (D3:D22) that holds the scores for the other variable.

    With values entered for each argument, the answer, 0.81662505, appears in the dialog box.

  4. Click OK to put the answer into the selected cell.

    Selecting PEARSON instead of CORREL gives you exactly the same answer, and you use it exactly the same way.

RSQ

If you have to quickly calculate the coefficient of determination (r2), RSQ is the function for you. I see no particular need for this function because it’s easy enough to use CORREL and then square the answer.

Here’s what the Excel Formula bar looks like after you fill in the RSQ Function Arguments dialog box for this example:

=RSQ(GPA,SAT)

In terms of the dialog box, the only difference between this one and CORREL (and PEARSON) is that the boxes you fill in are called Known_y’s and Known_x’s rather than Array1 and Array2.

COVARIANCE.P and COVARIANCE.S

As far as calculating correlations go, I see no need for these formulas. Just for completeness, I’ll tell you that COVARIANCE.P calculates covariance like this:

images

and COVARIANCE.S calculates covariance like this:

images

The P in the first function tells you that it’s calculating the covariance for a population and the S in the second tells you it’s calculating the covariance for a sample (or more correctly, for estimating covariance in a population).

You use these functions the same way you use CORREL. After you fill in the Function Arguments dialog box for COVARIANCE.P for this example, the formula in the Formula bar is

=COVARIANCE.P(SAT,GPA)

If you want to use this function to calculate r, you divide the answer by the product of STDEV.P(SAT) and STDEV.P(GPA). I leave it to you to figure out how you’d use COVARIANCE.S to calculate r. I don’t know why you’d bother with all of this when you can just use CORREL.

Data Analysis Tool: Correlation

If you have to calculate a single correlation coefficient, you’ll find that Excel’s Correlation data analysis tool does the same thing CORREL does, although the output is in tabular form. This tool becomes useful when you have to calculate multiple correlations on a set of data.

For example, Figure 15-5 shows SAT, high school average, and GPA for 20 Sahutsket University students, along with the dialog box for the Correlation data analysis tool.

image

FIGURE 15-5: The Correlation data analysis tool dialog box, along with data for SAT, High School Average, and GPA.

The steps for using this tool are:

  1. Type the data into the worksheet and include labels for the columns.

    In this example, the data (including labels) are in C2:E22.

  2. Select Data | Data Analysis to open the Data Analysis dialog box.
  3. In the Data Analysis dialog box, scroll down the Analysis Tools list and select Correlation.
  4. Click OK to open the Correlation dialog box.

    This dialog box is shown in Figure 15-5.

  5. In the Input Range box, enter the cell range that holds all the data.

    I entered $C$2:$E$22. Note the $ signs for absolute referencing. Excel adds them when you select the cell range in the spreadsheet.

  6. To the right of Grouped By, select a radio button to indicate if the data are organized in columns or rows.

    I chose the Columns radio button.

  7. If the cell ranges include column headings, select the Labels check box.

    I included the headings in the ranges, so I selected the box.

  8. In the Output Options, select a radio button to indicate where you want the results.

    I selected New Worksheet Ply to put the results on a new page in the worksheet.

  9. Click OK.

    Because I selected New Worksheet Ply, a newly created page opens with the results.

Tabled output

Figure 15-6 shows the tool’s tabled output, after I expand the columns. The table is a correlation matrix.

image

FIGURE 15-6: The Correlation data analysis tool’s tabled output.

Each cell in the matrix represents the correlation of the variable in the row with the variable in the column. Cell B3 presents the correlation of SAT with High School Average, for example. Each cell in the main diagonal contains 1. This is because each main diagonal cell represents the correlation of a variable with itself.

It’s only necessary to fill in half the matrix. The cells above the main diagonal would contain the same values as the cells below the main diagonal.

What does this table tell you, exactly? Read on… .

Multiple correlation

The correlation coefficients in this matrix combine to produce a multiple correlation coefficient. This is a number that summarizes the relationship between the dependent variable — GPA, in this example — and the two independent variables (SAT and High School Average).

To show you how these correlation coefficients combine, I abbreviate GPA as G, SAT as S, and High School Average as H. So rGS is the correlation coefficient for GPA and SAT, rGH is the correlation coefficient for GPA and High School Average, and rSH is the correlation coefficient for SAT and High School Average.

Here’s the formula that puts them all together:

images

The uppercase R on the left indicates that this is a multiple correlation coefficient, as opposed to the lowercase r that indicates a correlation between two variables. The subscript G.SH means that the multiple correlation is between GPA and the combination of SAT and High School Average.

This is the calculation that produces Multiple R in the Regression Statistics section of the Regression data analysis tool’s results. (See Chapter 14.)

For this example,

images

Because I use the same data to show you multiple regression in Chapter 14, this value (with some additional decimal places) is in Figure 14-25, in cell B4.

If you square this number, you get the multiple coefficient of determination. In Chapter 14, I tell you about R Square, and that’s what this is. It’s another item in the Regression Statistics that the Regression data analysis tool calculates. You also find it in LINEST’s results, although it’s not labeled.

For this example, that result is

images

You can see this number in Figure 14-3 in cell H5 (the LINEST results). You can also see it in Figure 14-25, cell B5 (the Regression data analysis tool report).

Partial correlation

GPA and SAT are associated with High School Average (in the example). Each one’s association with High School Average might somehow hide the true correlation between them.

What would their correlation be if you could remove that association? Another way to say it is this: What would be the GPA-SAT correlation if you could hold High School Average constant?

One way to hold High School Average constant is to find the GPA-SAT correlation for a sample of students who have one High School Average — 87, for example. In a sample like that, the correlation of each variable with High School Average is zero. This usually isn’t feasible in the real world, however.

Another way is to find the partial correlation between GPA and SAT. This is a statistical way of removing each variable’s association with High School Average in your sample. You use the correlation coefficients in the correlation matrix to do this:

images

Once again, G stands for GPA, S for SAT, and H for High School Average. The subscript GS.H means that the correlation is between GPA and SAT with High School Average “partialled out.”

For this example,

images

Semipartial correlation

It’s also possible to remove the correlation with High School Average from just SAT without removing it from GPA. This is called semipartial correlation. The formula for this one also uses the correlation coefficients from the correlation matrix:

images

The subscript G(S.H) means the correlation is between GPA and SAT with High School Average partialled out of SAT only.

Applying this formula to the example,

images

remember Some statistics textbooks refer to semipartial correlation as part correlation.

Data Analysis Tool: Covariance

You use the Covariance data analysis tool the same way you use the Correlation data analysis tool. I won’t go through the steps again. Instead, I just show you the tabled output in Figure 15-7. The data are from Figure 15-5.

image

FIGURE 15-7: The Covariance data analysis tool’s tabled output for SAT, High School Average, and GPA.

The table is a covariance matrix. Each cell in the matrix shows the covariance of the variable in the row with the variable in the column (calculated the way COVARIANCE.P would do it, by using N in the formula). Cell C4 shows the covariance of GPA with High School Average. The main diagonal in this matrix presents the variance of each variable (which is equivalent to the covariance of a variable with itself). In this case, the variance is what you compute if you use VARP.

Again, it’s only necessary to fill half the matrix. Cells above the main diagonal would hold the same values as the cells below the main diagonal.

As is the case with COVAR, I don’t see why you’d use this tool. I just include it for completeness.

Testing Hypotheses About Correlation

Excel has no worksheet function for testing hypotheses about r. As I point out earlier, you perform a t-test whose formula is

images

With 0.817 stored in cell H12, I used this formula to calculate t:

=H12*SQRT(20-2)/SQRT(1-H12^2)

I then used the answer (6.011 and some additional decimal places) as input to TDIST (along with 18 df and 1 tail) to find that the one-tailed probability of the result is way less than .05.

Worksheet functions: FISHER, FISHERINV

Excel handles the rather complex transformations that enable you to test hypotheses about the difference between two correlation coefficients. FISHER transforms r to z. FISHERINV does the reverse. Just to refresh your memory, you use the transformed values in the formula

images

in which the denominator is

images

In the example I discuss earlier (Sahutsket versus Farshimmelt), the correlation coefficients were .817 and .752, and I did a two-tailed test. The first step is to transform each correlation. I’ll walk you through the steps for using FISHER to transform .817:

  1. Select a cell for FISHER’s answer.

    I selected B3 for the transformed value.

  2. From the Statistical Functions menu, select FISHER to open its Function Arguments dialog box.

    The FISHER Function Arguments dialog box appears, as shown in Figure 15-8.

  3. In the Function Arguments dialog box, type the appropriate value for the argument.

    In the x box, I typed .817, the correlation coefficient. The answer, 1.147728, appears in the dialog box.

  4. Click OK to put the answer into the selected cell.
image

FIGURE 15-8: The FISHER Function Arguments dialog box.

I selected B4 to store the transformation of .752. Next, I used this formula to calculate Z

=(B3-B4)/SQRT((1/(20-3))+(1/(24-3)))

Finally, I used NORM.S.INV to find the critical value of z for rejecting H0 with a two-tailed α of .05. Because the result of the formula (0.521633) is less than that critical value (1.96), the decision is to not reject H0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset