Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Exploring and Visualizing

Once you’ve imported your data and cleaned and transformed it into a suitable state, you get to start asking questions like “what does it all mean?” The two main tools at your disposal are summary statistics and plots. (Modeling comes later, because you need to understand your data before you can model it properly.) R is well served by a comprehensive set of functions for calculating statistics, and a choice of three different graphics systems.

Chapter Goals

After reading this chapter, you should:

Be able to calculate a range of summary statistics on numeric data
Be able to draw standard plots in R’s three plotting systems
Be able to manipulate those plots in simple ways

Summary Statistics

We’ve already come across many of the functions for calculating summary statistics, so this section is partly a recap. Most are fairly obvious in their naming and their usage; for example, mean and median calculate their respective measures of location. There isn’t a function for the mode, but it can be calculated from the results of the table function, which gives counts of each element. (If you haven’t already, have a go at Exercise 13-3 now.)

In the following examples, the obama_vs_mccain dataset contains the fractions of people voting for Obama and McCain in the 2008 US presidential elections, along with some contextual background information on demographics:

data(obama_vs_mccain, package = "learningr")
obama <- obama_vs_mccain$Obama
mean(obama)

## [1] 51.29

median(obama)

## [1] 51.38

The table function doesn’t make a great deal of sense for the obama variable (or many numeric variables) since each value is unique. By combining it with cut, we can see how many values fall into different bins:

table(cut(obama, seq.int(0, 100, 10)))

##
##   (0,10]  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70]  (70,80]
##        0        0        0        8       16       16        9        1
##  (80,90] (90,100]
##        0        1

var and sd calculate the variance and standard deviation, respectively. Slightly less common is the mad function for calculating the mean absolute deviation:

var(obama)

## [1] 123.1

sd(obama)

## [1] 11.09

mad(obama)

## [1] 11.49

There are several functions for getting the extremes of numeric data. min and max are the most obvious, giving the smallest and largest values of all their inputs, respectively. pmin and pmax (the “parallel” equivalents) calculate the smallest and largest values at each point across several vectors of the same length. Meanwhile, the range function gives the minimum and maximum in a single function call:

min(obama)

## [1] 32.54

with(obama_vs_mccain, pmin(Obama, McCain))

##  [1] 38.74 37.89 44.91 38.86 36.91 44.71 38.22  6.53 36.93 48.10 46.90
## [12] 26.58 35.91 36.74 48.82 44.39 41.55 41.15 39.93 40.38 36.47 35.99
## [23] 40.89 43.82 43.00 49.23 47.11 41.60 42.65 44.52 41.61 41.78 36.03
## [34] 49.38 44.50 46.80 34.35 40.40 44.15 35.06 44.90 44.75 41.79 43.63
## [45] 34.22 30.45 46.33 40.26 42.51 42.31 32.54

range(obama)

## [1] 32.54 92.46

cummin and cummax provide the smallest and largest values so far in a vector. Similarly, cumsum and cumprod provide sums and products of the values to date. These functions make most sense when the input has been ordered in a useful way:

cummin(obama)

##  [1] 38.74 37.89 37.89 37.89 37.89 37.89 37.89 37.89 37.89 37.89 37.89
## [12] 37.89 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91
## [23] 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91 35.91
## [34] 35.91 35.91 35.91 34.35 34.35 34.35 34.35 34.35 34.35 34.35 34.35
## [45] 34.22 34.22 34.22 34.22 34.22 34.22 32.54

cumsum(obama)

##  [1]   38.74   76.63  121.54  160.40  221.34  275.00  335.59  428.05
##  [9]  489.96  540.87  587.77  659.62  695.53  757.38  807.23  861.16
## [17]  902.71  943.86  983.79 1041.50 1103.42 1165.22 1222.55 1276.61
## [25] 1319.61 1368.84 1415.95 1457.55 1512.70 1566.83 1623.97 1680.88
## [33] 1743.76 1793.46 1837.96 1889.34 1923.69 1980.44 2034.91 2097.77
## [41] 2142.67 2187.42 2229.21 2272.84 2307.06 2374.52 2427.15 2484.49
## [49] 2527.00 2583.22 2615.76

cumprod(obama)

##  [1] 3.874e+01 1.468e+03 6.592e+04 2.562e+06 1.561e+08 8.377e+09 5.076e+11
##  [8] 4.693e+13 2.905e+15 1.479e+17 6.937e+18 4.984e+20 1.790e+22 1.107e+24
## [15] 5.519e+25 2.976e+27 1.237e+29 5.089e+30 2.032e+32 1.173e+34 7.261e+35
## [22] 4.487e+37 2.572e+39 1.391e+41 5.980e+42 2.944e+44 1.387e+46 5.769e+47
## [29] 3.182e+49 1.722e+51 9.841e+52 5.601e+54 3.522e+56 1.750e+58 7.789e+59
## [36] 4.002e+61 1.375e+63 7.801e+64 4.249e+66 2.671e+68 1.199e+70 5.367e+71
## [43] 2.243e+73 9.785e+74 3.349e+76 2.259e+78 1.189e+80 6.817e+81 2.898e+83
## [50] 1.629e+85 5.302e+86

The quantile function provides, as you might expect, quantiles (median, min, and max are special cases). It defaults to the median, minimum, maximum, and lower and upper quartiles, and in an impressive feat of overengineering, it gives a choice of nine different calculation algorithms:

quantile(obama)

##    0%   25%   50%   75%  100%
## 32.54 42.75 51.38 57.34 92.46

quantile(obama, type = 5)    #to reproduce SAS results

##    0%   25%   50%   75%  100%
## 32.54 42.63 51.38 57.34 92.46

quantile(obama, c(0.9, 0.95, 0.99))

##   90%   95%   99%
## 61.92 65.17 82.16

IQR wraps quantile to give the interquartile range (the 75th percentile minus the 25th percentile):

IQR(obama)

## [1] 14.58

fivenum provides a faster, greatly simplified alternative to quantile. You only get one algorithm, and only the default quantiles can be calculated. It has a niche use where speed matters:

fivenum(obama)

## [1] 32.54 42.75 51.38 57.34 92.46

There are some shortcuts for calculating multiple statistics at once. You’ve already met the summary function, which accepts vectors or data frames:

summary(obama_vs_mccain)

##         State        Region       Obama          McCain
##  Alabama   : 1   IV     : 8   Min.   :32.5   Min.   : 6.53
##  Alaska    : 1   I      : 6   1st Qu.:42.8   1st Qu.:40.39
##  Arizona   : 1   III    : 6   Median :51.4   Median :46.80
##  Arkansas  : 1   V      : 6   Mean   :51.3   Mean   :47.00
##  California: 1   VIII   : 6   3rd Qu.:57.3   3rd Qu.:55.88
##  Colorado  : 1   VI     : 5   Max.   :92.5   Max.   :65.65
##  (Other)   :45   (Other):14
##     Turnout      Unemployment      Income        Population
##  Min.   :50.8   Min.   :3.40   Min.   :19534   Min.   :  563626
##  1st Qu.:61.0   1st Qu.:5.05   1st Qu.:23501   1st Qu.: 1702662
##  Median :64.9   Median :5.90   Median :25203   Median : 4350606
##  Mean   :64.1   Mean   :6.01   Mean   :26580   Mean   : 6074128
##  3rd Qu.:68.0   3rd Qu.:7.25   3rd Qu.:28978   3rd Qu.: 6656506
##  Max.   :78.0   Max.   :9.40   Max.   :40846   Max.   :37341989
##  NA's   :4
##     Catholic      Protestant       Other      Non.religious     Black
##  Min.   : 6.0   Min.   :26.0   Min.   :0.00   Min.   : 5    Min.   : 0.4
##  1st Qu.:12.0   1st Qu.:46.0   1st Qu.:2.00   1st Qu.:12    1st Qu.: 3.1
##  Median :21.0   Median :54.0   Median :3.00   Median :15    Median : 7.4
##  Mean   :21.7   Mean   :53.8   Mean   :3.29   Mean   :16    Mean   :11.1
##  3rd Qu.:29.0   3rd Qu.:62.0   3rd Qu.:4.00   3rd Qu.:19    3rd Qu.:15.2
##  Max.   :46.0   Max.   :80.0   Max.   :8.00   Max.   :34    Max.   :50.7
##  NA's   :2      NA's   :2      NA's   :2      NA's   :2
##      Latino      Urbanization
##  Min.   : 1.2   Min.   :   1
##  1st Qu.: 4.3   1st Qu.:  46
##  Median : 8.2   Median : 101
##  Mean   :10.3   Mean   : 386
##  3rd Qu.:12.1   3rd Qu.: 221
##  Max.   :46.3   Max.   :9856
##

The cor function calculates correlations between numeric vectors. As you would expect, there was an almost perfect negative correlation between the fraction of people voting for Obama and the fraction of people voting for McCain. (The slight imperfection is caused by voters for independent candidates.) The cancor function (short for “canonical correlation”) provides extra details, and the cov function calculates covariances:

with(obama_vs_mccain, cor(Obama, McCain))

## [1] -0.9981

with(obama_vs_mccain, cancor(Obama, McCain))

## $cor
## [1] 0.9981
##
## $xcoef
##         [,1]
## [1,] 0.01275
##
## $ycoef
##          [,1]
## [1,] -0.01287
##
## $xcenter
## [1] 51.29
##
## $ycenter
## [1] 47

with(obama_vs_mccain, cov(Obama, McCain))

## [1] -121.7

The Three Plotting Systems

Over its lifetime, R has accumulated three different plotting systems. base graphics are the oldest system, having been around as long as R itself. base graphs are easy to get started with, but they require a lot of fiddling and magic incantations to polish, and are very hard to extend to new graph types.

To remedy some of the limitations of base, the grid graphics system was developed to allow more flexible plotting. grid lets you draw things at a very low level, specifying where to draw each point, line, or rectangle. While this is wonderful, none of us have time to write a couple of hundred lines of code each time we want to draw a scatterplot.

The second plotting system, lattice, is built on top of the grid system, providing high-level functions for all the common plot types. It has two standout features that aren’t available in base graphics. First, the results of each plot are saved into a variable, rather than just being drawn on the screen. This means that you can draw something, edit it, and draw it again; groups of related plots are easier to draw, and plots can be saved between sessions. The second great feature is that plots can contain multiple panels in a lattice,^[49] so you can split up your data into categories and compare the differences between groups. This solves the plotting equivalent of the split-apply-combine problem that we discussed in Chapter 9.

The ggplot2 system, also built on top of grid, is the most modern of the three plotting systems. The “gg” stands for “grammar of graphics,''^[50] which aims to break down graphs into component chunks. The result is that code for a ggplot looks a bit like the English way of articulating what you want in the graph.

The three systems are, sadly, mostly incompatible (there are ways to combine base and grid graphics, but they should be considered a last resort). The good news is that you can do almost everything you want in ggplot2, so learning all three systems is mostly overkill. There are a couple of rare use cases where ggplot2 isn’t appropriate: it does more calculation than other graphics systems, so for quick and dirty plots of very large datasets it can be more convenient to use another system. Also, many plotting packages are based on one of the other two systems, so using those packages requires a little knowledge of base or lattice.

The following examples demonstrate all three systems; if you are pushed for time, then just take note of the ggplot2 parts. Due to space constraints, this chapter can only give a taste of some of the possibilities on offer. Fortunately, there are three excellent and easy to read books on graph drawing in R, namely R Graphics, ggplot2, and Lattice, by the authors of the grid, ggplot2, and lattice systems, respectively.^[51]

Scatterplots

Perhaps the most common of all plots is the scatterplot, used for exploring the relationships between two continuous variables. The obama_vs_mccain dataset has lots of numeric variables that we can compare, but we’ll start by asking, “Does voter income affect turnout at the polls?”

Take 1: base Graphics

The base graphic function to draw a scatterplot is simply plot. The best-practice code style these days is to keep all the variables you want for a plot together inside a data frame (or possibly several), rather than having them scattered in individual vectors. Unfortunately, plot predates^[52] this idea, so we have to wrap it in a call to with to access the columns.

Although plot will simply ignore missing values, for tidiness let’s remove the rows with missing Turnout values:

obama_vs_mccain <- obama_vs_mccain[!is.na(obama_vs_mccain$Turnout), ]

We can then create a simple scatterplot, shown in Figure 14-1:

with(obama_vs_mccain, plot(Income, Turnout))

Figure 14-1. A simple scatterplot using base graphics

plot has many arguments for customizing the output, some of which are more intuitive than others. col changes the color of the points. It accepts any of the named colors returned by colors, or an HTML-style hex value like "#123456". You can change the shape of the points with the pch argument (short for “plot character”).^[53] Figure 14-2 shows an updated scatterplot, changing the point color to violet and the point shape to filled-in circles:

with(obama_vs_mccain, plot(Income, Turnout, col = "violet", pch = 20))

Figure 14-2. Setting color and point shape using base graphics

Log scales are possible by setting the log argument. log = "x" means use a logarithmic x-scale, log = "y" means use a logarithmic y-scale, and log = "xy" makes both scales logarithmic. Figures 14-3 and 14-4 display some options for log-scaled axes:

with(obama_vs_mccain, plot(Income, Turnout, log = "y"))
#Fig. 14-3

with(obama_vs_mccain, plot(Income, Turnout, log = "xy"))
#Fig. 14-4

Figure 14-3. Log y-scale using base graphics

Figure 14-4. Log x- and y-scales using base graphics

We can see that there is a definite positive correlation between income and turnout, and it’s stronger on the log-log scale. A further question is, “Does the relationship hold across all of the USA?” To answer this, we can split the data up into the 10 Standard Federal Regions given in the Region column, and plot each of the subsets in a “matrix” in one figure. The layout function is used to control the layout of the multiple plots in the matrix. Don’t feel obliged to spend a long time trying to figure out the meaning of the next code chunk; it only serves to show that drawing multiple related plots together in base graphics is possible. Sadly, the code invariably looks like it fell out of the proverbial ugly tree, so this technique should only be used as a last resort. Figure 14-5 shows the result:

par(mar = c(3, 3, 0.5, 0.5), oma = rep.int(0, 4), mgp = c(2, 1, 0))
regions <- levels(obama_vs_mccain$Region)
plot_numbers <- seq_along(regions)
layout(matrix(plot_numbers, ncol = 5, byrow = TRUE))
for(region in regions)
{
  regional_data <- subset(obama_vs_mccain, Region == region)
  with(regional_data,  plot(Income, Turnout))
}

Figure 14-5. Multiple plots in the same figure using base graphics

Take 2: lattice Graphics

The lattice equivalent of plot is xyplot. It uses a formula interface to specify the variables for the x and y coordinates. Formulae will be discussed in more depth in Formulae, but for now just note that you need to type yvar ~ xvar. Conveniently, xyplot (and other lattice functions) takes a data argument that tells it which data frame to look for variables in. Figure 14-6 shows the lattice equivalent of Figure 14-1:

library(lattice)
xyplot(Turnout ~ Income, obama_vs_mccain)

Figure 14-6. A simple scatterplot using lattice graphics

Many of the options for changing plot features are the same as those in base graphics. Figure 14-7 changes the color and point shape, mimicking Figure 14-2:

xyplot(Turnout ~ Income, obama_vs_mccain, col = "violet", pch = 20)

Figure 14-7. Setting color and point shape using lattice graphics

Axis scales, however, are specified in a different way. lattice plots take a scales argument, which must be a list. The contents of this list must be name = value pairs; for example, log = TRUE sets a log scale for both axes. The scales list can also take further (sub)list arguments named x and y that specify settings for only those axes. Don’t panic, it isn’t as complicated as it sounds. Figures 14-8 and 14-9 show examples of scaled axes:

xyplot(
  Turnout ~ Income,
  obama_vs_mccain,
  scales = list(log = TRUE)            #both axes log scaled (Fig. 14-8)
)

xyplot(
  Turnout ~ Income,
  obama_vs_mccain,
  scales = list(y = list(log = TRUE))  #y-axis log scaled (Fig. 14-9)
)

Figure 14-8. Log x- and y-scales using lattice graphics

Figure 14-9. Log y-scale using lattice graphics

The formula interface makes splitting the data by region vastly easier. All we have to do is to append a | (that’s a “pipe” character; the same one that is used for logical “or”) and the variable that we want to split by, in this case Region. Using the argument relation = "same" means that each panel shares the same axes. Axis ticks for each panel are drawn on alternating sides of the plot when the alternating argument is TRUE (the default), or just the left and bottom otherwise. The output is shown in Figure 14-10; notice the improvement over Figure 14-5:

xyplot(
  Turnout ~ Income | Region,
  obama_vs_mccain,
  scales = list(
    log         = TRUE,
    relation    = "same",
    alternating = FALSE
  ),
  layout = c(5, 2)
)

Figure 14-10. Multiple plots in the same figure using lattice graphics

Another benefit is that lattice plots are stored in variables, (as opposed to base plots, which are just drawn in a window) so we can sequentially update them. Figure 14-11 shows a lattice plot that is updated in Figure 14-12:

(lat1 <- xyplot(
  Turnout ~ Income | Region,
  obama_vs_mccain
))
#Fig. 14-11

(lat2 <- update(lat1, col = "violet", pch = 20))
#Fig. 14-12

Figure 14-11. This plot is stored as a variable that is reused in Figure 14-12

Figure 14-12. This plot reuses a lattice variable from Figure 14-11

Take 3: ggplot2 Graphics

ggplot2 (the “2” is because it took a couple of attempts to get it right) takes many of the good ideas in lattice and builds on them. So, splitting plots up into panels is easy, and sequentially building plots is also possible. Beyond that, ggplot2 has a few special tricks of its own. Most importantly, its “grammatical” nature means that it consists of small building blocks, so it’s easier to create brand new plot types, if you feel so inclined.

The syntax is a very different to other plotting code, so mentally prepare yourself to look at something new. Each plot is constructed with a call to the ggplot function, which takes a data frame as its first argument and an aesthetic as its second. In practice, that means passing the columns for the x and y variables to the aes function. We then add a geom to tell the plot to display some points. Figure 14-13 shows the result:

library(ggplot2)
ggplot(obama_vs_mccain, aes(Income, Turnout)) +
  geom_point()

Figure 14-13. A simple scatterplot using ggplot2 graphics

ggplot2 recognizes the commands from base for changing the color and shape of the points, but also has its own set of more human-readable names. In Figure 14-14, “shape” replaces “pch,” and color can be specified using either “color” or “colour”:

ggplot(obama_vs_mccain, aes(Income, Turnout)) +
  geom_point(color = "violet", shape = 20)

Figure 14-14. Setting color and point shape using ggplot2 graphics

To set a log scale, we add a scale for each axis, as seen in Figure 14-15. The breaks argument specifies the locations of the axis ticks. It is optional, but used here to replicate the behavior of the base and +lattice examples:

ggplot(obama_vs_mccain, aes(Income, Turnout)) +
  geom_point() +
  scale_x_log10(breaks = seq(2e4, 4e4, 1e4)) +
  scale_y_log10(breaks = seq(50, 75, 5))

Figure 14-15. Log scales using ggplot2 graphics

To split the plot into individual panels, we add a facet. Like the lattice plots, facets take a formula argument. Figure 14-16 demonstrates the facet_wrap function. For easy reading, the x-axis ticks have been rotated by 30 degrees and right-justified using the theme function:

ggplot(obama_vs_mccain, aes(Income, Turnout)) +
  geom_point() +
  scale_x_log10(breaks = seq(2e4, 4e4, 1e4)) +
  scale_y_log10(breaks = seq(50, 75, 5)) +
  facet_wrap(~ Region, ncol = 4)

Figure 14-16. Multiple plots in the same figure using ggplot2 graphics

To split by multiple variables, we would specify a formula like ~ var1 + var2 + var3. For the special case of splitting by exactly two variables, facet_grid provides an alternative that puts one variable in rows and one in columns.

As with lattice, ggplots can be stored in variables and added to sequentially. The next example redraws Figure 14-13 and stores it as a variable. As usual, wrapping the expression in parentheses makes it auto-print:

(gg1 <- ggplot(obama_vs_mccain, aes(Income, Turnout)) +
  geom_point()
)

Figure 14-17 shows the output. We can then update it as follows, with the result shown in Figure 14-18:

(gg2 <- gg1 +
  facet_wrap(~ Region, ncol = 5) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
)

Figure 14-17. This plot is stored as a variable that is reused in Figure 14-18

Figure 14-18. This plot reuses a ggplot2 variable from Figure 14-17

Line Plots

For exploring how a continuous variable changes over time, a line plot often provides more insight than a scatterplot, since it displays the connections between sequential values. These next examples examine a year in the life of the crab in the crab_tag dataset, and how deep in the North Sea it went.

In base, line plots are created in the same way as scatterplots, except that they take the argument type = "l". To avoid any dimensional confusion^[54] we plot the depth as a negative number rather than using the absolute values given in the dataset.

Ranges in the plot default to the ranges of the data (plus a little bit more; see the xaxs section of the ?par help page for the exact details). To get a better sense of perspective, we’ll manually set the y-axis limit to run from the deepest point that the crab went in the sea up to sea level, by passing a ylim argument. Figure 14-19 displays the resulting line plot:

with(
  crab_tag$daylog,
  plot(Date, -Max.Depth, type = "l", ylim = c(-max(Max.Depth), 0))
)

Figure 14-19. A line plot using base graphics

At the moment, this only shows half the story. The Max.Depth argument is the deepest point in the sea that the crab reached on a given day. We also need to add a line for the Min.Depth to see the shallowest point on each day. Additional lines can be drawn on an existing plot using the lines function. The equivalent for scatterplots is points. Figure 14-20 shows the additional line:

with(
  crab_tag$daylog,
  lines(Date, -Min.Depth, col = "blue")
)

Figure 14-20. Adding a second line using base graphics

Line plots in lattice follow a similar pattern to base. They use xyplot, as with scatterplots, and require the same type = "l" argument. Specifying multiple lines is blissfully easy using the formula interface. Notice the + in the formula used to create the plot in Figure 14-21:

xyplot(-Min.Depth + -Max.Depth ~ Date, crab_tag$daylog, type = "l")

Figure 14-21. A line plot using lattice graphics

In ggplot2, swapping a scatterplot for a line plot is as simple as swapping geom_plot for geom_line (Figure 14-22 shows the result):

ggplot(crab_tag$daylog, aes(Date, -Min.Depth)) +
  geom_line()

Figure 14-22. A line plot using ggplot2 graphics

There’s a little complication with drawing multiple lines, however. When you specify aesthetics in the call to ggplot, you specify them for every geom. That is, they are “global” aesthetics for the plot. In this case, we want to specify the maximum depth in one line and the minimum depth in another, as shown in Figure 14-23. One solution to this is to specify the y-aesthetic inside each call to geom_line:

ggplot(crab_tag$daylog, aes(Date)) +
  geom_line(aes(y = -Max.Depth)) +
  geom_line(aes(y = -Min.Depth))

Figure 14-23. Two lines with separate geoms using ggplot2 graphics

This is a bit clunky, though, as we have to call geom_line twice, and actually it isn’t a very idiomatic solution. The “proper” ggplot2 way of doing things, shown in Figure 14-24, is to melt the data to long form and then group the lines:

library(reshape2)
crab_long <- melt(
  crab_tag$daylog,
  id.vars      = "Date",
  measure.vars = c("Min.Depth", "Max.Depth")
)
ggplot(crab_long, aes(Date, -value, group = variable)) +
  geom_line()

Figure 14-24. Two lines with grouping using ggplot2 graphics

In this case, where there are only two lines, there is an even better solution that doesn’t require any data manipulation. geom_ribbon plots two lines, and the contents in between. For prettiness, we pass the color and fill argument to the geom, specifying the color of the lines and the bit in between. Figure 14-25 shows the result:

ggplot(crab_tag$daylog, aes(Date, ymin = -Min.Depth, ymax = -Max.Depth)) +
  geom_ribbon(color = "black", fill = "white")

Figure 14-25. A ribbon plot using ggplot2 graphics

Whichever system you use to draw the plot, the behavior of the crab is clear. In September it lives in shallow waters for the mating season, then it spends a few months migrating into deeper territory. Through winter, spring, and summer it happily sits on the North Sea seabed (except for an odd, brief trip to the surface at the start of June—dodgy data, or a narrow escape from a fishing boat?), then it apparently falls off a cliff in mid-July, before making its way back to shallow climes for another round of rumpy-pumpy, at which point it is caught.

Histograms

If you want to explore the distribution of a continuous variable, histograms are the obvious choice.^[55]

For the next examples we’ll return to the obama_vs_mccain dataset, this time looking at the distribution of the percentage of votes for Obama. In base, the hist function draws a histogram, as shown in Figure 14-26. Like plot, it doesn’t have a data argument, so we have to wrap it inside a call to with:

with(obama_vs_mccain, hist(Obama))

Figure 14-26. A histogram using base graphics

The number of breaks is calculated by default by Sturges’s algorithm. It is good practice to experiment with the width of bins in order to get a more complete understanding of the distribution. This can be done in a variety of ways: you can pass hist a single number to specify the number of bins, or a vector of bin edges, or the name of a different algorithm for calculating the number of bins ("scott" and "fd" are currently supported on top of the default of "sturges"), or a function that calculates one of the first two options. It’s really flexible. In the following examples, the results of which are shown in Figures 14-27 to 14-31, the main argument creates a main title above the plot. It works for the plot function too:

with(obama_vs_mccain,
  hist(Obama, 4, main = "An exact number of bins")
)
#Fig. 14-27

Figure 14-27. Specifying histogram breaks using an exact number of bins with base graphics

with(obama_vs_mccain,
  hist(Obama, seq.int(0, 100, 5), main = "A vector of bin edges")
)
#Fig. 14-28

Figure 14-28. Specifying histogram breaks using an exact number of bins with base graphics

with(obama_vs_mccain,
  hist(Obama, "FD", main = "The name of a method")
)
#Fig. 14-29

Figure 14-29. Specifying histogram breaks using the name of a method with base graphics

with(obama_vs_mccain,
  hist(Obama, nclass.scott, main = "A function for the number of bins")
)
#Fig. 14-30

Figure 14-30. Specifying histogram breaks using a function for the number of bins with base graphics

binner <- function(x)
{
  seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = 50)
}
with(obama_vs_mccain,
  hist(Obama, binner, main = "A function for the bin edges")
)
#Fig. 14-31

Figure 14-31. Specifying histogram breaks using a function for the bin edges with base graphics

The freq argument controls whether or not the histogram shows counts or probability densities in each bin. It defaults to TRUE if and only if the bins are equally spaced. Figure 14-32 shows the output:

with(obama_vs_mccain, hist(Obama, freq = FALSE))

Figure 14-32. A probability density histogram using base graphics

lattice histograms behave in a similar manner to base ones, except for the usual benefits of taking a data argument, allowing easy splitting into panels, and saving plots as a variable. The breaks argument behaves in the same way as with hist. Figures 14-33 and 14-34 show lattice histograms and the specification of breaks:

histogram(~ Obama, obama_vs_mccain)
#Fig. 14-33

histogram(~ Obama, obama_vs_mccain, breaks = 10)
#Fig. 14-34

Figure 14-33. Histogram using lattice graphics

Figure 14-34. Specifying histogram breaks using lattice graphics

lattice histograms support counts, probability densities, and percentage y-axes via the type argument, which takes the string "count", "density", or "percent". Figure 14-35 shows the "percent" style:

histogram(~ Obama, obama_vs_mccain, type = "percent")

Figure 14-35. A percentage-scaled histogram using lattice graphics

ggplot2 histograms are created by adding a histogram geom. Bin specification is simple: just pass a numeric bin width to geom_histogram. The rationale is to force you to manually experiment with different numbers of bins, rather than settling for the default. Figure 14-36 shows the usage:

ggplot(obama_vs_mccain, aes(Obama)) +
  geom_histogram(binwidth = 5)

Figure 14-36. A histogram using ggplot2 graphics

You can choose between counts and densities by passing the special names ..count.. or ..density.. to the y-aesthetic. Figure 14-37 demonstrates the use of ..density..:

ggplot(obama_vs_mccain, aes(Obama, ..density..)) +
  geom_histogram(binwidth = 5)

Figure 14-37. A probability density histogram using ggplot2 graphics

Box Plots

If you want to explore the distribution of lots of related variables, you could draw lots of histograms. For example, if you wanted to see the distribution of Obama votes by US region, you could use latticing/faceting to draw 10 histograms. This is just about feasible, but it doesn’t scale much further. If you need a hundred histograms, the space requirements can easily overwhelm the largest monitor. Box plots (sometimes called box and whisker plots) are a more space-efficient alternative that make it easy to compare many distributions at once. You don’t get as much detail as with a histogram or kernel density plot, but simple higher-or-lower and narrower-or-wider comparisons can easily be made.

The base function for drawing box plots is called boxplot; it is heavily inspired by lattice, insofar as it uses a formula interface and has a data argument. Figure 14-38 shows the usage:

boxplot(Obama ~ Region, data = obama_vs_mccain)

Figure 14-38. A box plot using base graphics

This type of plot is often clearer if we reorder the box plots from smallest to largest, in some sense. The reorder function changes the order of a factor’s levels, based upon some numeric score. In Figure 14-39 we score the Region levels by the median Obama value for each region:

ovm <- within(
  obama_vs_mccain,
  Region <- reorder(Region, Obama, median)
)
boxplot(Obama ~ Region, data = ovm)

Figure 14-39. Ordering boxes using base graphics

The switch from base to lattice is very straightforward. In this simplest case, we can make a straight swap of boxplot for bwplot (“bw” is short for “b (box) and w (whisker),” in case you hadn’t figured it out). Notice the similarity of Figure 14-40 to Figure 14-38:

bwplot(Obama ~ Region, data = ovm)

Figure 14-40. A box plot using lattice graphics

The ggplot2 equivalent box plot, shown in Figure 14-41, just requires that we add a geom_boxplot:

ggplot(ovm, aes(Region, Obama)) +
  geom_boxplot()

Figure 14-41. A box plot using ggplot2 graphics

Bar Charts

Bar charts (a.k.a. bar plots) are the natural way of displaying numeric variables^[56] split by a categorical variable. In the next examples, we look at the distribution of religious identification across the US states. Data for Alaska and Hawaii are not included in the dataset, so we can remove those records:

ovm <- ovm[!(ovm$State %in% c("Alaska", "Hawaii")), ]

In base, bar charts are created with the barplot function. As with the plot function, there is no argument to specify a data frame, so we need to wrap it in a call to with. The first argument to barplot contains the lengths of the bars. If that is a named vector (which it won’t be if you are doing things properly and accessing data from inside a data frame), then those names are used for the bar labels. Otherwise, as we do here, you need to pass an argument called names.arg to specify the labels. By default the bars are vertical, but in order to make the state names readable we want horizontal bars, which can be generated with horiz = TRUE.

To display the state names in full, we also need to do some fiddling with the plot parameters, via the par function. For historical reasons, most of the parameter names are abbreviations rather than human-readable values, so the code can look quite terse. It’s a good idea to read the ?par help page before you modify a base plot.

The las parameter (short for “label axis style”) controls whether labels are horizontal, vertical, parallel, or perpendicular to the axes. Plots are usually more readable if you set las = 1, for horizontal. The mar parameter is a numeric vector of length 4, giving the width of the plot margins at the bottom/left/top/right of the plot. We want a really wide lefthand side to fit the state names. Figure 14-42 shows the output of the following code:

par(las = 1, mar = c(3, 9, 1, 1))
with(ovm, barplot(Catholic, names.arg = State, horiz = TRUE))

Simple bar charts like this are fine, but more interesting are bar charts of several variables at once. We can visualize the split of religions by state by plotting the Catholic, Protestant, Non.religious, and Other columns. For plotting multiple variables, we must place them into a matrix, one in each row (rbind is useful for this).

The column names of this matrix are used for the names of the bars; if there are no column names we must pass a names.arg like we did in the last example. By default, the bars for each variable are drawn next to each other, but since we are examining the split between the variables, a stacked bar chart is more appropriate. Passing beside = FALSE achieves this, as illustrated in Figure 14-43:

religions <- with(ovm, rbind(Catholic, Protestant, Non.religious, Other))
colnames(religions) <- ovm$State
par(las = 1, mar = c(3, 9, 1, 1))
barplot(religions, horiz = TRUE, beside = FALSE)

Figure 14-42. A bar chart using base graphics

Figure 14-43. A stacked bar chart using base graphics

The lattice equivalent of barplot, shown in Figure 14-44, is barchart. The formula interface is the same as those we saw with scatterplots, yvar ~ xvar:

barchart(State ~ Catholic, ovm)

Figure 14-44. A bar chart using lattice graphics

Extending this to multiple variables just requires a tweak to the formula, and passing stack = TRUE to make a stacked plot (see Figure 14-45):

barchart(
  State ~ Catholic + Protestant + Non.religious + Other,
  ovm,
  stack = TRUE
)

Figure 14-45. A stacked bar chart using lattice graphics

ggplot2 requires a tiny bit of work be done to the data to replicate this plot. We need the data in long form, so we must first melt the columns that we need:

religions_long <- melt(
  ovm,
  id.vars = "State",
  measure.vars = c("Catholic", "Protestant", "Non.religious", "Other")
)

Like base, gplot2 defaults to vertical bars; adding coord_flip swaps this. Finally, since we already have the lengths of each bar in the dataset (without further calculation) we must pass stat = "identity" to the geom. Bars are stacked by default, as shown in Figure 14-46:

ggplot(religions_long, aes(State, value, fill = variable)) +
  geom_bar(stat = "identity") +
  coord_flip()

Figure 14-46. A stacked bar chart using ggplot2 graphics

To avoid the bars being stacked, we would have to pass the argument position = "dodge" to geom_bar. Figure 14-47 shows this:

ggplot(religions_long, aes(State, value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip()

Figure 14-47. A dodged bar chart using ggplot2 graphics

The other possibility for that argument is position = "fill", which creates stacked bars that are all the same height, ranging from 0 to 100%. Try it!

Other Plotting Packages and Systems

There are many packages that contain plotting capabilities based on one or more of the three systems. For example, the vcd package has lots of plots for visualizing categorical data, such as mosaic plots and association plots. plotrix has loads of extra plot types, and there are specialist plots scattered in many other packages.

latticeExtra and GGally extend the lattice and ggplot2 packages, and grid provides access to the underlying framework that supports both these systems.

You may have noticed that all the plots covered so far are static. There have in fact been a number of attempts to provide dynamic and interactive plots.^[57] There is no perfect solution yet, but there are many interesting and worthy packages that attempt this.

gridSVG lets you write grid-based plots (lattice or ggplot2) to SVG files. These can be made interactive, but it requires some knowledge of JavaScript. playwith allows pointing and clicking to interact with base or lattice plots. iplots provides a whole extra system of plots with even more interactivity. It isn’t easily extensible, but the common plots types are there, and you can explore the data very quickly via mouse interaction. googleVis provides an R wrapper around Google Chart Tools, creating plots that can be displayed in a browser. rggobi provides an interface to GGobi (for visualizing high-dimensional data), and rgl provides an interface to OpenGL for interactive 3D plots. The animation package lets you make animated GIFs or SWF animations.

The rCharts package provides wrappers to half a dozen JavaScript plotting libraries using a lattice syntax. It isn’t yet available via CRAN, so you’ll need to install it from GitHub:

library(devtools)
install_github("rCharts", "ramnathv")

Summary

There are loads of summary statistics that can be calculated.
R has three plotting systems: base, lattice, and ggplot2.
All the common plot types are supported in every system.
There is some support in R for dynamic and interactive plotting.

Test Your Knowledge: Quiz

Question 14-1: What is the difference between the min and pmin functions?
Question 14-2: How would you change the shape of the points in a base plot?
Question 14-3: How do you specify the x and y variables in a lattice plot?
Question 14-4: What is a ggplot2 aesthetic?
Question 14-5: Name as many plot types as you can think of for exploring the distribution of a continuous variable.

Test Your Knowledge: Exercises

Exercise 14-1

In the obama_vs_mccain dataset, find the (Pearson) correlation between the percentage of unemployed people within the state and the percentage of people that voted for Obama. [5]
Draw a scatterplot of the two variables, using a graphics system of your choice. (For bonus points, use all three systems.) [10] for one plot, [30] for all three

Exercise 14-2

In the alpe_d_huez2 dataset, plot the distributions of fastest times, split by whether or not the rider (allegedly) used drugs. Display this using a) histograms and b) box plots. [10]

Exercise 14-3

The gonorrhoea dataset contains gonorrhoea infection rates in the US by year, age, ethnicity, and gender. Explore how infection rates change with age. Is there a time trend? Do ethnicity and gender affect the infection rates? [30]

^[49]There are several terms for this. Edward Tufte called the idea “small multiples” in Envisioning Information; Bill Cleveland and Rick Becker of Bell Labs coined the term “trellising”; Deepayan Sarkar renamed it “latticing” in the lattice package to avoid a Bell Labs trademark; and Leland Wilkinson named it “faceting,” a term that is used in ggplot2.

^[50]The concept was devised by Leland Wilkinson in the book of the same name. The book is brilliant, but not suitable for bedtime reading, being densely packed with equations.

^[51]Books on plotting have the advantage that even if you can’t be bothered to read them, they are pretty to flick through.

^[52]Predates as in “comes before,” not “hunts and eats.”

^[53]Read the ?points help page and try plot(1:25, pch = 1:25, bg = "blue") to see the different shapes.

^[54]Insert your own Australia/upside-down joke here.

^[55]Dataviz purists often note that kernel density plots generally give a “better” representation of the underlying distribution. The downside is that every time you show them to a non-statistician, you have to spend 15 minutes explaining what a kernel density plot is.

^[56]More specifically, they have to be counts or lengths or other numbers that can be compared to zero. For log-scaled things where the bar would extend to minus infinity, you want a dot plot instead.

^[57]Dynamic means animation; interactive means point and click to change them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Exploring and Visualizing

Create new playlist

Sign In

Sign Up