Layers in ggplot2

As just discussed in the previous section, we saw how important the concept of layers was when creating a plot with ggplot2. These layers are then combined with a coordinate system and other transformations which then generate the final plot. But what exactly are the layers? In the grammar of graphics as implemented in ggplot2, the layers are responsible for the objects that we see in the graph. Each layer can come from a different dataset, have different geometry, and have a different aesthetic mapping. As you can see in Figure 3.1, the layers are composed of several components—the data, aesthetic, geom, stat, and position adjustment. Not all these components are needed in order to create a layer, but a minimal layer can be created just by including the data, aesthetic mapping, and geom that will define the type of plot to be generated. In fact, the geometry is a very important component of the layer since no visualization is possible without specifying the geometry.

Data

As you can see, the data represents the actual data shown in the plot. At this point, ggplot2 contains a major restriction compared with other plotting packages in R—the data must be a data frame. This means that even if you have your observations as vector for instance, you would first need to combine them in a data frame and then realize the plot. The reason for this is to ensure that the data used in the plots can be easily traced back, even to people other than the author. Moreover, structuring the data in data frames somehow forces the user to keep it organized, thus reducing the possibility of mistakes and errors.

You should also keep in mind that when you create a data object, the data is copied within the object, so if you change something in the data, the change will not appear in the plot unless you create the plot object again. This is particularly important since ggplot2 objects can be saved in variables or stored in a workspace, so you should pay attention that the plot is actually updated with changes in data when you save and load ggplot2 objects.

Aesthetic mapping

As you saw in the examples presented previously in this chapter, aesthetic mapping is provided via the aes() function, which can be used, for instance, for x and y mapping, color mapping, or size and shape mapping. All the variables mapped should be present in the data provided, and if mapping is performed within the geom or stat function, the data should be specified even within the body of the function.

Tip

Aesthetic mapping to x and y

When working with aesthetic mapping, keep in mind that even the mapping to the x and y variables in the plot is a part of aesthetic mapping, and, for this reason, it must be included in the aes() functions.

When discussing scales, we described how scales were used for mapping aesthetic arguments such as the x-y position in the axes. In the case of color, we can have continuous mapping, where different color levels are mapped to a continuous scale of variables, or we can even have discrete mapping, where the levels of a categorical variable are mapped to different colors. You can see examples of such scales in Figure 3.4. When colors are mapped to a categorical variable, the continuous scale of colors is used to select the specific value of a color, which is mapped to the variables. These colors are, by default, selected as equally spaced from the so-called color wheel, represented in Figure 3.12:

Aesthetic mapping

Figure 3.12: This is a color wheel used to select equally spaced colors for the mapping to categorical covariates. Three equally spaced colors assigned by default in ggplot2 are also shown on the outside of the wheel

The default color scheme is selected using the scale_color_hue() function, which uses the hue_pal function from the scales package to assign the selected color. Calling this function allows you to find the actual color used in the plot by default; this can turn out to be quite useful if you need to reproduce a color assigned by default in a plot. For instance, if you want to know the first three colors of the series, you can use the following code:

library(scales)
scales::hue_pal()(3)

The output will be as follows:

 [1] "#F8766D" "#00BA38" "#619CFF"

In Figure 3.12, you can see where these colors are located with respect to the color wheel and how they are actually equally spaced starting from the twelve o'clock position on the wheel.

The aesthetic attributes that you can map to variables depend on the geom function used. In the following table, you can find the arguments, mandatory and optional, associated with the most important geom functions. As you can easily imagine, the x and/or y arguments are often mandatory, but the optional arguments are also interesting since those are the arguments you can use to personalize your plot and to shape it in the best way to describe the data you have. Among such arguments, you will find, for instance, the fill argument we used to color the internal part of the histograms or the alpha argument we used for transparency. You can use this table as a reference to quickly search for such arguments and to have a look at the possible alternatives that could be provided by different functions:

Main geom functions

Mandatory aesthetic

Optional aesthetic

geom_abline

 

alpha, color, linetype, size

geom_area

x, ymax (ymin fix to 0)

alpha, color, fill, linetype, size

geom_bar

x

alpha, color, fill, linetype, size, weight

geom_boxplot

lower, middle, upper, ymax, ymin

x, alpha, color, fill, linetype, shape, size, weight

geom_density

x, y

alpha, color, fill, linetype, size, weight

geom_dotplot

x, y

alpha, color, fill

geom_histogram

x

alpha, color, fill, linetype, size, weight

geom_hline

 

alpha, color, linetype, size

geom_jitter

x, y

alpha, color, fill, shape, size

geom_line

x, y

alpha, color, linetype, size

geom_point

x, y

alpha, color, fill, shape, size

geom_ribbon

x, ymax, ymin

alpha, color, fill, linetype, size

geom_smooth

x, y

alpha, color, fill, linetype, size, weight

geom_text

label, x, y

alpha, angle, color, family, fontface, hjust, lineheight, size, vjust

Geometric

The geometry attributes define the actual type of plot that will be applied to the data provided with the ggplot() function. These attributes are provided using functions with the general form geom_x, where x can be replaced by the specified geometry, such as, for instance, histogram or point. Additionally, different data can be used by providing a new dataset to the geom function. It is also possible to combine different geometries by combining different functions with the + operator, for instance, geom_point() + geom_smooth().

Tip

How to find the names of geom functions

When using ggplot2 in your coding, it can happen that you won't remember all the names of the geom functions, particularly if you need a special functionality you are not familiar with. Obviously, you can use the table provided by this book or have a look at the ggplot2 website, but one trick that can turn out to be useful is the use of the apropos()function available in R to search for a function by a string contained in the function name. So, for instance, using the following code will list all the functions with the "geom" string in their name.

apropos("geom")

Of course, a similar approach could also be used to search for the "stat" and "coord" functions.

In the following table, you will find a reference to the most important geom functions available in ggplot2, with a short description indicating the actual plot generated by the function. You can also find the default statistical transformation executed by the function for each function. Pay attention to this argument because if you need a different statistical transformation, you would need to change this argument. You will find more details about statistical transformations in the next section.

Here's the table we talked about:

Main geom functions

Default stat

Description

geom_abline

abline

This is a line specified by the slope and intercept.

geom_area

identity

This is an area plot, which is a continuous analogue of a stacked bar chart. It is a special case of geom_ribbon.

geom_bar

bin

These are bars with bases on the x axis.

geom_blank

identity

This is blank and doesn't draw anything.

geom_boxplot

boxplot

This is a box-and-whiskers plot.

geom_density

density

This is a smooth density estimate calculated by stat_density.

geom_dotplot

bindot

This is a dot plot (the width of a dot corresponds to the bin width and each dot represents one observation).

geom_errorbar

identity

These add error bars to plots by coupling with other geometries.

geom_errorbarh

identity

These are horizontal error bars.

geom_histogram

bin

This is a histogram.

geom_hline

hline

This is a horizontal line.

geom_jitter

identity

These are points jittered (usually to reduce overplotting).

geom_line

identity

These connect observations ordered by the x value.

geom_path

identity

These connect observations in their original order.

geom_point

identity

This represents observations as points, as in a scatterplot.

geom_pointrange

identity

This is an interval represented by a vertical line, with a point in the middle.

geom_ribbon

identity

This is a ribbon of the y range with continuous x values.

geom_smooth

smooth

These add a smoothed conditional mean.

geom_text

identity

These are textual annotations.

geom_tile

identity

This is a tile plane with rectangles.

geom_vline

vline

This is a vertical line.

Stat

A statistical transformation or stat is a statistical manipulation applied to the data, usually to summarize the data. A simple example would be the stat_bin() transformation which summarizes the data in bins typically for representation in a histogram. The general structure of these functions is "stat_x", where x can be replaced by the statistical transformation.

As you have seen in the previous table, each geometry comes with a default statistical transformation that is applied to the data. You would wonder why your data should be statistically manipulated if you only need a typical x-y plot, but among the statistical transformations, there is also the identity transformation which basically means that the data is left unchanged. This is usually the transformation applied to geom, for which an actual transformation is not needed. With this approach, in ggplot2, it is always possible to have a connection between the geom and stat arguments, keeping the code structure coherent and at the same time providing high flexibility since you can always change the default stat argument and generate new plots.

What statistical transformations do is basically take the data provided for the plot, apply the transformation, and return a new dataset, which is then used in the plot (as mentioned, the stat_identity() function does not do anything). Depending on the stat applied, this new dataset could contain new variables as outputs of applied statistical transformation. In the following table, you will find a summary of the main stat functions with a short description and a the list of new variables created in the transformation. These new variables are pretty interesting since they can also be mapped to aesthetic attributes in the plot. We will see a few examples of how to do this in Chapter 4, Advanced Plotting Techniques.

Main stat functions

New variables created

Description

stat_bin

count, density, ncount, ndensity

These split data into bins for histograms.

stat_bindot

x, y, binwidth, count, ncount, density, ndensity

These split data into bins for dot plots.

stat_boxplot

width, ymin, lower, notchlower, middle, notchupper, upper, ymax

This calculates the components of a box-and-whiskers plot.

stat_density

density, count, scaled

These calculate the kernel density estimate for a density plot (geom_density).

stat_function

x, y

These superimpose a function to the plot.

stat_identity

 

These plot data without any statistical transformation.

stat_quantile

quantile

These calculate continuous quantiles.

stat_smooth

y, ymin, ymax, se

These add a smoother line.

stat_sum

n, prop

This is the sum of unique values. This is useful for plotting on scatterplots.

stat_summary

fun.data, fun.ymin, fun.y, fun.ymax

These summarise y values at every unique x value.

stat_unique

 

These remove duplicates.

Position adjustment

Position adjustments are used to adjust the position of each geom. These adjustments do not refer to formatting the legend, axes, titles, and other similar components of the plot; they apply only to the elements in the plot area, such as bars in a bar plot and points in a scatterplot, and they can be applied to continuous and categorical data. As for the stat function, even in this case, we have the position_identity() function which does not adjust the position and which is used if there is no need for any adjustment.

Position adjustment of categorical data

These kinds of adjustments are more commonly used. They are often applied to bar plots in order to adjust the position of the bars. We have already seen an example of such an adjustment in Figure 2.7 in Chapter 2, Getting Started. There are different kinds of adjustments available:

  • Dodge: It is done using the position_dodge() function. In this adjustment, the bars in a bar plot are placed next to each other for each category.
  • Fill: It is realized with the position_fill() function. In this adjustment, the objects are overlapped on top of each other and standardized to have the same height, so in a bar plot, bars of the same category are stacked upon one another and the heights are equalized, so the bars would represent proportions and not absolute numbers of frequency.
  • Stack: It is done with the position_stack() function. It is the same as fill but without the height standardisation. Stacking is the default behavior in many area plots, such as bar plots.

We will now recreate the same plots as in Figure 2.7 of Chapter 2, Getting Started by using the ggplot() function and the position adjustment functions so you will have a reference on how to use these functions we have just introduced. The position adjustment specification is provided within the geom function for which position adjustment should be applied. In order to do that, you can simply specify the position desired to the position argument of the geom function. This is the easier way to use position adjustment and, in this case, you will use the default specification of each position_x function. The following code shows this:

ggplot(data=myMovieData, aes(x=Type,fill=factor(Short))) + geom_bar(position="stack")
ggplot(data=myMovieData, aes(x=Type,fill=factor(Short))) + geom_bar(position="dodge")
ggplot(data=myMovieData, aes(x=Type,fill=factor(Short))) + geom_bar(position="fill")

If, on the other hand, you want to provide specifications different to the default values, you can use position functions as in the following example:

ggplot(data=myMovieData, aes(x=Type,fill=factor(Short))) + geom_bar(position=position_dodge(width = 0.5))

Position adjustment of continuous data

There is only one position adjustment for continuous data, and that is jittering. We have already seen an example of jittering in Figure 2.12 of Chapter 2, Getting Started, jittering as a position adjustment is performed by the position_jitter() function. However, since jitter is the default position adjustment in the geom_jitter() function, in most cases, if you want to realize the jittering of data, you can simply use the geom_jitter() function. On the other hand, if you need to specify parameters different to the default values, then you will need to use the position_jitter() function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset