Correlations tell us how well two variables relate to each other. As with summary statistics, calculating the correlations between variables in our dataset is a fast and easy way to acquire an initial understanding of our data.
Let us use correlations to investigate a few of the relationships in our head to head battle data:
Rating
and Result
. Be sure to use the numeric version of the Result
column in your calculation:> #use cor(x,y) to calculate the correlation between two variables > #remember only to use numeric values when calculating correlations > #How is the performance rating of the Shu army related to the outcome of a head to head battle? > corRatingResultHeadToHead <- cor(subsetHeadToHead$Rating, numericResultHeadToHead)
> #display the value of the correlation > corRatingResultHeadToHead [1] 0.9495232
ShuSoldiersEngaged
and WeiSoldiersEngaged:
> #How is the number of Shu soldiers engaged in a head to head battle correlated with the number of Wei soldiers engaged? > corShuWeiSoldiersHeadToHead <- cor(subsetHeadToHead$ShuSoldiersEngaged, subsetHeadToHead$WeiSoldiersEngaged)
> #display the value of the correlation > corShuWeiSoldiersHeadToHead [1] 0.7653596
> use cor(data) to calculate the correlation between all numeric variables in a dataset > #How are all of our numeric battle data correlated with one another? > corHeadToHead <- cor(subsetHeadToHead)
> #display the correlations > corHeadToHead
We calculated just a few correlations to get an idea of how they can be derived in R. This entailed using the cor()
function in two different ways.
Correlations range in value from negative one (-1) to positive one (1). A value of negative one means that two variables are perfectly negatively correlated. That is, a high value in one is associated with a low value in the other, and vice versa. On the other hand, a correlation of positive one indicates that two variables are perfectly positively correlated. As such, high values in one are associated with high values in the other, and vice versa. Further, a correlation of zero indicates that two variables are perfectly uncorrelated. This means that their values do not associate with one another. Of course, these extreme correlational values are rare. Most correlations will fall somewhere between negative one and zero or zero and positive one.
Here are a few examples that demonstrate how to interpret correlations:
A
and B
suggests a relatively weak positive relationship exists between the variables. If A
were to decrease by a certain amount, we would only expect a small decrease in B
. A
and B
suggests a relatively strong negative relationship exists between the variables. If A
were to increase, we would expect a B
to decrease by proportionally similar amount. A
and B
suggests that the variables are uncorrelated. Therefore, movements in A
would not be expected to associate with movements in B
.An important final note on correlation is that it should never be interpreted as causation. Correlation merely tells us that our variables tend to move with each other in a certain way. Yet, we cannot determine which, if either, of the correlated variables causes the change in the other. Therefore, correlations inform us about what is occurring between our variables, but cannot tell us why it is happening.
The cor(x,y)
function is used to calculate the correlation between two variables, x
and y
. For instance, to calculate the correlation between variable A
and variable B
, we would use the following code:
> cor(A, B)
We looked directly at two correlations. First, we found the correlation between the performance rating of the Shu army and the outcome of head to head battles to be 0.95. This correlation suggests that victory or defeat in a given head to head battle had a large impact on Zhuge Liang's rating of the army's performance in that conflict.
Next, we calculated the correlation between the number of Shu and Wei soldiers engaged in head to head battles. Here, we found a relatively strong positive correlation of 0.77. This suggests that the number of soldiers that one army engages in combat is highly related to the size of the opposing army. This is logical, because we would expect an army's size in a given battle to be closely related to (but not necessarily determined by or equal to) the size of the opposing army.
The same correlation function can be used in a different way. Instead of providing x
and y
variables to calculate a single correlation via cor(x, y)
, we can calculate all of the possible correlations in a dataset using cor(data)
. For example, to find the correlations for all of the numeric variables in dataset A
, we would use the following code:
> cor(A)
This use of the cor()
function yields a correlation table, similar to the one that we generated for our head to head dataset.
To read a value from this table, match a row name on the left-hand side with a column name across the top. At the intersection, you will find the correlation between the two variables. For instance, if you traced from ShuSoldiersEngaged
on the left to WeiSoldiersEngaged
on the top, you would encounter the correlation of 0.77 that we had previously calculated using cor(x,y)
.
A critical limitation of the cor(data)
technique is that only numeric variables in a dataset can be correlated. You probably noticed that several NA
values were reported in the correlation table of our head to head dataset. These occur because our SuccessfullyExecuted
and Result
columns consisted of nonnumeric data. Therefore they could not be correlated and R returned NA
values. To correlate nonnumeric values, as we did with Result
in step 1, they must first be recoded as numeric.
You may run into NA
values in other aspects of your R work. When these occurs, it is a good idea to check your data to make sure that they are in the proper format for the function or calculation that you wish to employ.
cor(x,y)
and cor(data)?
a. The cor(x, y)
variation calculates all of the correlations in a dataset, whereas cor(data)
calculates a single correlation between two variables.
b. The cor(x, y)
variation calculates a single correlation between two variables, whereas cor(data)
calculates all of the correlations in a dataset.
c. The cor(x, y)
variation calculates all of the correlations between two datasets, whereas cor(data)
calculates all of the correlations in a given dataset.
d. The cor(x, y)
variation calculates all correlations between two variables, whereas cor(data)
calculates all correlations for a given variable.
A
and B
.a. A
and B
are negatively correlated. For every one unit increase in A, B
will decrease by 0.25 units.
b. A
and B
are negatively correlated. For every one unit decrease in A, B
will decrease by 0.25 units
c. A
and B
are negatively correlated. We would expect an increase in A
to be accompanied by a proportionally small increase in B
.
d. A
and B
are negatively correlated. We would expect an increase in A
to be accompanied by a proportionally small decrease in B
.
You may have noticed that all of the points in our head to head combat dataset have a value of Y
for SuccessfullyExecuted
, which prevents us from correlating it with other variables. This is because the Shu forces can engage in head to head combat at will and without some variation in the values for execution, a correlation is incalculable.
In contrast, our surround, ambush, and fire attack methods greatly depend on successful execution. Try correlating the Rating
column with the SuccessfullyExecuted
column in each of these battle methods. Then, interpret your findings.
Afterwards, use cor(data)
to visualize all of the correlations in your datasets. Interpret these correlations and take note of any that stand out as expected or unexpected. By investigating correlations, you are becoming ever more aware of you data.