The Pearson correlation coefficient

You have probably already heard about the Pearson coefficient, since it is the most popular measure of correlation, and the most widely applied.

This is probably related to its ease of calculation and interpretation. We compute the Pearson correlation coefficient, as follows:

We find, on the numerator, an index named covariance, between X and Y, which we will cover in a second. On the denominator, we see the product of standard deviations of X and Y. The covariance is in some way a raw Pearson correlation value, meaning that within the formula it is the member intended to express the linear relationship between the two variables. We can get this looking at the covariance formula:

This is familiar to you, isn't it? We have actually seen this nearly 15 minutes ago, when looking at the variance.

As you see, we are dealing again with a difference from the mean. Nevertheless, something more is introduced here: the product between the differences for variable X and Y. Why do we do that? Because here we are not just interested in getting the single variable variability, but their joint variability, and multiplying their differences to let us reach that point.

Moreover, if you take a second to think about it, each of these elements, that is, the product of their differences, will have a sign based on how the variables are behaving, and this sign will express if they are moving in the same direction or not. For instance, if both of the variables are lower than the mean, this will produce two negative differences, and by that way a positive product, which will indicate that both are going in the same direction. Finally, the sum of these quantities will have a sign by itself, which will summarize what is the overall direction of the relationship.

I see by your face that we need a numerical example. Take the following couple of numbers:

X	Y
2	4
5	3
4	2
6	4
7	3

To compute the covariance for those numbers, we first of all compute the mean for both of the variables:

mean X = 4.8

mean Y = 3.2

We then compute the differences between the mean for each of them:

X	Y	(X-mean(X))	(Y-mean(Y))
2	4	-2,8	0,8
5	3	0,2	-0,2
4	2	-0,8	-1,2
6	4	1,2	0,8
7	3	2,2	-0,2

We can already start thinking about this number: what is the most common occurrence? That a negative difference for X is paired with a positive difference for Y, or vice versa. This expresses an inverse behavior, that is, a specular or symmetric behavior of one variable in relation to the other.

This should by now be confirmed from the products and the final covariance:

X	Y	(X-mean(X))	(Y-mean(Y))	*(X-mean(X))(Y-mean(Y))**
2	4	-2,8	0,8	-2,24
5	3	0,2	-0,2	-0,04
4	2	-0,8	-1,2	0,96
6	4	1,2	0,8	0,96
7	3	2,2	-0,2	-0,44

The last column, which shows the majority of times a negative sign, sums up to 0.8. This is the numerator of our previously introduced covariance formula:

We now have to divide it by the size of the population minus one. Let's compute the covariance then:

- 0.8/4 = -0.20

As we would have expected, a negative covariance is shown here, which is a result of a negative product of differences, and expresses an inverse relationship between the two variables. But how much is that -0.2? Is it a lot or just a small amount? 0.2 out of ? From these questions, the Pearson coefficient was born. This coefficient solves the problem of covariance interpretation, dividing it by the product of the standard deviations of the two variables. I am not going to show it to you formally, but given the definition of standard deviation, dividing the covariance by this product leads to a ratio that can range from -1 to 1.

The ratio can therefore be easily interpreted as follows:

ratio > 0 implies direct linear relationship/dependence between variables
ratio = 0 implies the absence of relationship/dependence between variables
ratio > 0 implies inverse linear relationship/dependence between variables

How near the ratio will be from the unity will then express the intensity of this direct/inverse relationship. Nice, now let's move on to our real data and compute the correlation coefficient between time and cash flows. We can use the cor function. This function can compute different types of correlation, but if you do not specify anything about this, it will compute Pearson:

cor (x =cash_flow_report$y, y = cash_flow_report$cash_flow)
Error in cor(x = cash_flow_report$y, y = cash_flow_report$cash_flow) :
 'x' must be numeric

Uh, not what we were looking for.

You can take as an example of debugging: any idea what is going on? The console is warning us against some problem with the format of our x variable, which is the sequence of date of cash flows reporting.

This means that is is not able to compute the mean and the differences from the means for these dates. We therefore have to transform this variable a bit before passing it to the cor function. How would you do this? Yeah... you can check on Google...

Did you find anything? Too much, you say? That is why there is still a meaning into the training on the job locution, I guess... Anyway, an elegant way to transform a sequence of dates into a progression of numbers is to compute the difference in days between the oldest date recorded and all the others. This means assigning a value of 0 to the oldest date, for every record where it is shown, and a value equal to the number of days in the middle for any other date.

First, we need to find the oldest date and assign it to a vector:

oldest <- min(cash_flow_report$y)

Yeah, it is that simple, the min of a sequence of dates is the oldest one. And which is the oldest date?

oldest
[1] "2014-03-31"

Then, we add a column to our data, named delays, where the difference from the given date and the oldest object is computed. To do that, we employ the difftime function available within base R. This function just requires you to specify the dates for which to compute the difference and the unity of measurement in which you want the result to be expressed:

cash_flow_report %>% 
mutate(delays = difftime(cash_flow_report$y, oldest, units = "days")) -> cash_flow_report_mutation

Let's have a look at the results employing the head() function:

x y cash_flow delays
 1 north_america 2014-03-31 100955.81 0 days
 2 south_america 2014-03-31 111817.48 0 days
 3 asia 2014-03-31 132019.06 0 days
 4 europe 2014-03-31 91369.00 0 days
 5 north_africa 2014-03-31 109863.53 0 days
 6 middle_east 2014-03-31 94753.52 0 days

Uhm, it seems it actually created the variable computing the difference in days between the y variable and the oldest date, which for the first records is the same.

We are now going to actually compute the correlation, adopting one final caution: the delays variable is of the difftime class, and needs to be transformed into a fully numerical variable before being passed to the cor function. We are going to apply the as.numeric() function here for that:

cor(x = as.numeric(cash_flow_report_mutation$delays),
 y = cash_flow_report_mutation$cash_flow)

And here we are: -0.1148544.

This means that a weak inverse relationship is observed between the increase of time and the volume of cash flows. Unfortunately, the number is pretty small and we cannot therefore consider it as a great hint of some kind of general negative trend involving our sales. Nevertheless, we should be aware that we just checked for a linear relationship. Things could change now that we take a look at the distance correlation.

Table of Contents for The Pearson correlation coefficient

Create new playlist

Sign In

Sign Up

Table of Contents for
The Pearson correlation coefficient