Time for action - correlating variables

Correlations tell us how well two variables relate to each other. As with summary statistics, calculating the correlations between variables in our dataset is a fast and easy way to acquire an initial understanding of our data.

Let us use correlations to investigate a few of the relationships in our head to head battle data:

  1. Calculate the correlation between Rating and Result. Be sure to use the numeric version of the Result column in your calculation:
    > #use cor(x,y) to calculate the correlation between two variables
    > #remember only to use numeric values when calculating
    correlations
    > #How is the performance rating of the Shu army related to the
    outcome of a head to head battle?
    > corRatingResultHeadToHead <- cor(subsetHeadToHead$Rating,
    numericResultHeadToHead)
    
  2. Display the value of your correlation in the R console:
    > #display the value of the correlation
    > corRatingResultHeadToHead
    [1] 0.9495232
    
  3. Calculate the correlation between ShuSoldiersEngaged and WeiSoldiersEngaged:
    > #How is the number of Shu soldiers engaged in a head to head
    battle correlated with the number of Wei soldiers engaged?
    > corShuWeiSoldiersHeadToHead <-
    cor(subsetHeadToHead$ShuSoldiersEngaged,
    subsetHeadToHead$WeiSoldiersEngaged)
    
  4. Display the value of your correlation in the R console:
    > #display the value of the correlation
    > corShuWeiSoldiersHeadToHead
    [1] 0.7653596
    
  5. Calculate the correlations between (almost) all of the variables in the dataset:
    > use cor(data) to calculate the correlation between all
    numeric variables in a dataset
    > #How are all of our numeric battle data correlated with
    one another?
    > corHeadToHead <- cor(subsetHeadToHead)
    
  6. Display the values of your correlations in the R console, by using the following:
    > #display the correlations
    > corHeadToHead
    
    Time for action - correlating variables

What just happened?

We calculated just a few correlations to get an idea of how they can be derived in R. This entailed using the cor() function in two different ways.

Interpreting correlations

Correlations range in value from negative one (-1) to positive one (1). A value of negative one means that two variables are perfectly negatively correlated. That is, a high value in one is associated with a low value in the other, and vice versa. On the other hand, a correlation of positive one indicates that two variables are perfectly positively correlated. As such, high values in one are associated with high values in the other, and vice versa. Further, a correlation of zero indicates that two variables are perfectly uncorrelated. This means that their values do not associate with one another. Of course, these extreme correlational values are rare. Most correlations will fall somewhere between negative one and zero or zero and positive one.

Here are a few examples that demonstrate how to interpret correlations:

  • A correlation of 0.12 between A and B suggests a relatively weak positive relationship exists between the variables. If A were to decrease by a certain amount, we would only expect a small decrease in B.
  • A correlation of -0.87 between A and B suggests a relatively strong negative relationship exists between the variables. If A were to increase, we would expect a B to decrease by proportionally similar amount.
  • A correlation of 0.00001 between A and B suggests that the variables are uncorrelated. Therefore, movements in A would not be expected to associate with movements in B.

Note

An important final note on correlation is that it should never be interpreted as causation. Correlation merely tells us that our variables tend to move with each other in a certain way. Yet, we cannot determine which, if either, of the correlated variables causes the change in the other. Therefore, correlations inform us about what is occurring between our variables, but cannot tell us why it is happening.

cor(x, y)

The cor(x,y) function is used to calculate the correlation between two variables, x and y. For instance, to calculate the correlation between variable A and variable B, we would use the following code:

> cor(A, B)

We looked directly at two correlations. First, we found the correlation between the performance rating of the Shu army and the outcome of head to head battles to be 0.95. This correlation suggests that victory or defeat in a given head to head battle had a large impact on Zhuge Liang's rating of the army's performance in that conflict.

Next, we calculated the correlation between the number of Shu and Wei soldiers engaged in head to head battles. Here, we found a relatively strong positive correlation of 0.77. This suggests that the number of soldiers that one army engages in combat is highly related to the size of the opposing army. This is logical, because we would expect an army's size in a given battle to be closely related to (but not necessarily determined by or equal to) the size of the opposing army.

cor(data)

The same correlation function can be used in a different way. Instead of providing x and y variables to calculate a single correlation via cor(x, y), we can calculate all of the possible correlations in a dataset using cor(data). For example, to find the correlations for all of the numeric variables in dataset A, we would use the following code:

> cor(A)

This use of the cor() function yields a correlation table, similar to the one that we generated for our head to head dataset.

cor(data)

To read a value from this table, match a row name on the left-hand side with a column name across the top. At the intersection, you will find the correlation between the two variables. For instance, if you traced from ShuSoldiersEngaged on the left to WeiSoldiersEngaged on the top, you would encounter the correlation of 0.77 that we had previously calculated using cor(x,y).

NA values

A critical limitation of the cor(data) technique is that only numeric variables in a dataset can be correlated. You probably noticed that several NA values were reported in the correlation table of our head to head dataset. These occur because our SuccessfullyExecuted and Result columns consisted of nonnumeric data. Therefore they could not be correlated and R returned NA values. To correlate nonnumeric values, as we did with Result in step 1, they must first be recoded as numeric.

Note

See the Quantifying categorical variables section of this chapter for a demonstration of how to recode nonnumeric data in numeric form.

You may run into NA values in other aspects of your R work. When these occurs, it is a good idea to check your data to make sure that they are in the proper format for the function or calculation that you wish to employ.

Pop quiz

  1. What is the key difference between cor(x,y) and cor(data)?

    a. The cor(x, y) variation calculates all of the correlations in a dataset, whereas cor(data) calculates a single correlation between two variables.

    b. The cor(x, y) variation calculates a single correlation between two variables, whereas cor(data) calculates all of the correlations in a dataset.

    c. The cor(x, y) variation calculates all of the correlations between two datasets, whereas cor(data) calculates all of the correlations in a given dataset.

    d. The cor(x, y) variation calculates all correlations between two variables, whereas cor(data) calculates all correlations for a given variable.

  2. Interpret a correlation of -0.25 between the variables A and B.

    a. A and B are negatively correlated. For every one unit increase in A, B will decrease by 0.25 units.

    b. A and B are negatively correlated. For every one unit decrease in A, B will decrease by 0.25 units

    c. A and B are negatively correlated. We would expect an increase in A to be accompanied by a proportionally small increase in B.

    d. A and B are negatively correlated. We would expect an increase in A to be accompanied by a proportionally small decrease in B.

Have a go hero

You may have noticed that all of the points in our head to head combat dataset have a value of Y for SuccessfullyExecuted, which prevents us from correlating it with other variables. This is because the Shu forces can engage in head to head combat at will and without some variation in the values for execution, a correlation is incalculable.

In contrast, our surround, ambush, and fire attack methods greatly depend on successful execution. Try correlating the Rating column with the SuccessfullyExecuted column in each of these battle methods. Then, interpret your findings.

Afterwards, use cor(data) to visualize all of the correlations in your datasets. Interpret these correlations and take note of any that stand out as expected or unexpected. By investigating correlations, you are becoming ever more aware of you data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset