Within the R ecosystem, there are different packages offering ways to represent correlations between variables in a dataset.
In a way, the powerful plot()
function, as seen in the previous recipe, can also be useful for correlation spotting, particularly when plotting all variables against one another (refer to the previous recipe for more details).
Nevertheless, among different alternatives, the one I think may give you a quicker and deeper understanding of the relationship between your data is the pairs.panels()
function provided by the psych
package by William Revelle.
In order to use the pairs.panels()
function, we first need to install and load the psych
package:
install.packages("psych") library(psych)
To test the pairs.panels()
functionality, we will use the Iris dataset.
The Iris dataset is one of most used datasets in R tutorials and learning sessions, and it is derived from a 1936 paper by Ronald Fisher, named The use of multiple measurements in taxonomic problems.
Data was observed on 50 samples of three species of the iris flower:
On each sample for features were recorded:
In the following example, we will look for correlations between these variables.
pairs.panels()
:pairs.panels(iris, hist.col = "white", ellipses = FALSE)
The pairs.panels()
function produces quite a comprehensive plot, showing in one picture the following things:
The pairs.panels()
function allows you to customize the output; some customizations are purely pertaining to aesthetics and others are related to the computations that happen behind the panel visualization.
Part of the first group is the
hist.col
argument, which will set the color of the distribution plots produced by the function.
It is also possible to change methods for correlation computation, leveraging the method argument.
The following methods are available:
We can also specify if correlation ellipses, also named confidence or error ellipses, should be added to our plot through, as you may have probably guessed, the ellipses
argument.