Scatterplot matrix with GGally

A scatterplot matrix is a series of scatterplots organized in a grid and often used to describe the relationship between different variables. These plots can range over different degrees of complexity, from merely plotting correlations between variables up to histograms and kernel density plots of distributions that incorporate other variable metadata . If you are already familiar with the graphics package, the pairs() function can be used to generate a basic scatterplot matrix. Alternatively, the GGally package, a helper package of ggplot2, can be used to generate scatterplot matrices and other, more complex matrix figures in the ggplot2 style. It contains templates for different plots to be combined into a plot matrix, a parallel coordinate plot function, as well as a function for making a network plot. The main function available in this package is the ggpairs() function, which is able to generate a matrix scatterplot using ggplot2 graphs,. Its use is quite straightforward. We will see examples with the iris dataset, which we have already used previously in the book.

A basic use would simply imply passing the dataset to the function and eventually specifying typical ggplot2 arguments, such as color and alpha, in the example here:

require(GGally)
ggpairs(iris, color='Species', alpha=0.4)

Using the function in this way, it will generate a scatterplot matrix using all columns in the dataset as variables and by selecting the adequate default plot types depending on the nature of the parameter. You can see the plot we obtained in Figure 7.10:

Scatterplot matrix with GGally

Figure 7.10: A scatterplot matrix of the iris dataset with default settings

As illustrated in the resulting plot, we have represented the data from the three flower species in different colors. The variable names are represented along the diagonals of the matrix and the relationship between variables is described in the various subplots. The plot matrix can be divided into two areas, a lower and an upper part, respectively, below and above the diagonal containing the variables. For continuous variables, such as Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, it represents the correlation between each combination of variables. In the lower part, the data is represented as points, while in the upper part, the data is represented as details of the correlation coefficient. Since we have split the data into groups depending on the values of the Species column, this data is spread accordingly. The categorical variable, Species, is represented as a histogram in the lower part and as a boxplot in the upper one. A very useful option of the ggpairs() function is the possibility of choosing which representation to include in the lower and upper parts of the matrix depending on the type of variable represented. You can use the function arguments upper or lower and provide a list containing the different plot types. For each parameter combination, only one plot type can be selected. The following is a table summarizing the available plot options depending on the variable combination:

Argument

Variable combination

Plots available

continuous

continuous versus. continuous

"points", "smooth", "density", "cor","blank"

discrete

discrete versus. discrete

"facetbar", "ratio", "blank"

combo

continuous versus. discrete

"box", "dot", "facethist", "facetdensity", "denstrip", "blank"

The blank option is also available and can be used not to represent any plot for that variable combination. So, for instance, if you wanted to have the density plot for the combinations between continuous variables in the upper panels, you would use the following code:

ggpairs(iris, upper=list(continuous="density"), color='Species')

In the same way, you can also modify the lower plots in a similar way:

ggpairs(iris, upper=list(continuous="density"), lower=list(continuous="smooth"))

You can see this last plot in Figure 7.11:

Scatterplot matrix with GGally

Figure 7.11: A matrix scatterplot or iris dataset with density plots in the upper area and smooth lines in the lower area for continuous variables

As illustrated in the plot generated, in this case, we have removed the coloring of the observation depending on the species represented, so, as a consequence, we have obtained the smooth line in the lower panels as if all the data was coming from the same source. You can also notice how the panels defining the combination between the Species variable and all the other variables did not change. These panels, in fact, fall into the combo category since they are obtained from the combination of categorical and continuous variables. So, if we also want to modify these panels, for instance, to have a density plot, and obtain a smooth line for the different flower species, we can use the following code.

ggpairs(iris, upper=list(continuous="density"), lower=list(continuous="smooth",combo="facetdensity"), color="Species")

You can see the resulting plot in Figure 7.12:

Scatterplot matrix with GGally

Figure 7.12: A scatterplot matrix of the iris dataset with density plots in the upper area and smooth lines in the lower area for continuous variables and density plots in the lower panels for combo variables

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset