Correlation

The best summary statistics to look for when looking for the relationship among variables is a correlation. This measure is able to express the level of dependence between two variables, that is, how much the variation of one of the two is associated with the variation of the other.

To avoid one of the most common misconceptions, I want to immediately warn you against the correlation-causation failure: correlation does not imply causation.

This means that finding evidence of some variable being related to another one would never automatically imply the first being the cause of the second one, or vice versa.

You could, for instance, find a strong correlation between the number of precipitations in a country and the number of people winning the Nobel price in the same country, but this could not lead you to the conclusion that the first one is the cause of the latter, or at least I hope it would not. There should also be a website totally dedicated to the concept. Here it is: http://www.tylervigen.com/spurious-correlations. It is from the great Tyler Vigen.

Give a look, for instance, to this one:

Would you think that any kind of causation is going on here? Hopefully not; nevertheless, a sound 95% correlation coefficient was observed here. 

That said, it should also be noted that the opposite holds true: causation implies correlation, since two variables linked from a causation mechanism are going to show a high level of correlation.

But this is not the whole story: we talk about correlation, but there is actually more than one type of correlation, since we have also got linear correlation, and non-linear ones such as quadratic or exponential. There's not enough time to enter into details here, nevertheless, starting from this consideration, we are going to measure our variable relationship with two different coefficients: 

  • The Pearson coefficient, able to detect only linear correlation
  • The distance correlation, able to also detect non-linear correlations

We are going to look at correlation only for time and cash flows, since the geographic area is a categorical variable, and we cannot measure the correlation between continuous and categorical variables. To be more precise, we could do something similar working with dummies, ANOVA analysis, and regressions, but we do not have time for all that fun. 

Nevertheless, we will address the relationship between the geographic area and cash flows with some graphical EDA.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset