The basic idea of correlation is how sensitive the value of one entity or metric is to the change in another value of an entity or metric. Let's take a very simple, tangible example as a starting point. There is usually an observable correlation between a person's income level and their education. Census data in the United States typically shows that more educated people earn more money. Look at the data set plotted in Figure 3.1.
The first thing to realize is that a scatter plot best shows correlation visually, between two sets of variables. Here we can see that there is a positive correlation, that is, one where as the value of one variable increases there is an increase in the value of another variable. The correlation is actually quite strong in this case (mainly because I made up the data set). When we work through Model Builder 3.1 we will see that there is a .87 correlation between education level and income in this example. Another way of thinking about what this means is to use extreme cases. If the correlation is equal to 1, the two variables are completely linked, and for every unit of increase in one of the variables there will be a proportional increase in the other variable. Conversely, if the correlation is equal to 0, it means that for every unit of increase in one of the variables there should be no expectation of proportional movement in the other variable.
The formula for correlation (usually represented by r) is shown in equation 3.1:
In this equation, σx represents the standard deviation of the X values and μx represents the mean of the X values (careful: μ has different meanings elsewhere in this text and in finance: it is often used to represent a “drift” return for assets).
It is crucial to keep in mind that correlation measures only the linear relationship between two variables. Quadratic and other “power” relationships between variables may give results indicating no correlation between data sets. When beginning initial research on the relationship between different variables, it is a good idea to graph the scatter-plots and inspect them to see if the graph “looks like” a quadratic relationship. If so, performing regressions of one variable against the root of another may be appropriate.
=(C4-AVERAGE($C$4:$C$33))*(D4-AVERAGE($D$4:$D$33))
Copy and paste this over the range E4:E33. Also, we should label E3 “Numerator” since the prior calculation is part of a larger calculation.
=(C4-AVERAGE($C$4:$C$33))^2
=(D4-AVERAGE($D$4:$D$33))^2
Also, label both columns with “Denominator 1” in F3 and “Denominator 2” in G3. So far the sheet should look like Figure 3.2.
=SQRT(F35*G35)
=E35/E36
=CORREL(C4:C33,D4:D33)
The final calculations should look the same as Figure 3.3.