Chapter 16. Advanced graphics

 

This chapter covers

  • Trellis graphs and the lattice package
  • The grammar of graphs via ggplot2
  • Interactive graphics

 

In previous chapters, we created a wide variety of both general and specialized graphs (and had lots of fun in the process). Most were produced using R’s base graphics system. Given the diversity of methods available in R, it may not surprise you to learn that there are actually four separate and complete graphics systems currently available.

In addition to base graphics, we have graphics systems provided by the grid, lattice, and ggplot2 packages. Each is designed to expand on the capabilities of, and correct for deficiencies in, R’s base graphics system.

The grid graphics system provides low-level access to graphic primitives, giving programmers a great deal of flexibility in the creation of graphic output. The lattice package provides an intuitive approach for examining multivariate relationships through conditional 1-, 2-, or 3-dimensional graphs called trellis graphs. The ggplot2 package provides a method of creating innovative graphs based on a comprehensive graphical “grammar.”

In this chapter, we’ll start with an overview of the four graphic systems. Then we’ll focus on graphs that can be generated with the lattice and ggplot2 packages. These packages greatly expand the range and quality of the graphs you can produce in R.

We’ll end the chapter by considering interactive graphics. Interacting with graphs in real time can help you understand your data more thoroughly and develop greater insights into the relationships among variables. Here, we’ll focus on the functionality offered by the iplots, playwith, latticist, and rggobi packages.

16.1. The four graphic systems in R

As stated earlier, four primary graphical systems are available in R. The base graphic system in R, written by Ross Ihaka, is included in every R installation. Most of the graphs produced in previous chapters rely on base graphics functions.

The grid graphics system, written by Paul Murrell (2006), is implemented through the grid package. Grid graphics offer a lower-level alternative to the standard graphics system. The user can create arbitrary rectangular regions on graphics devices, define coordinate systems for each region, and use a rich set of drawing primitives to control the arrangement and appearance of graphic elements.

This flexibility makes grid graphics a valuable tool for software developers. But the grid package doesn’t provide functions for producing statistical graphics or complete plots. Because of this, the package is rarely used directly by data analysts.

The lattice package, written by Deepayan Sarkar (2008), implements trellis graphics as outlined by Cleveland (1985, 1993) and described on the Trellis website (http://netlib.bell-labs.com/cm/ms/departments/sia/project/trellis/). Built using the grid package, the lattice package has grown beyond Cleveland’s original approach to visualizing multivariate data, and now provides a comprehensive alternative system for creating statistical graphics in R.

The ggplot2 package, written by Hadley Wickham (2009a), provides a system for creating graphs based on the grammar of graphics described by Wilkinson (2005) and expanded by Wickham (2009b). The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations.

Access to the four systems differs, as outlined in table 16.1. Base graphic functions are automatically available. To access grid and lattice functions, you must load the package explicitly (for example, library(lattice)). To access ggplot2 functions, you have to download and install the package (install.packages("ggplot2")) before first use, and then load it (library(ggplot2)).

Table 16.1. Access to graphic systems

System

Included in base installation?

Must be explicitly loaded?

base Yes No
grid Yes Yes
lattice Yes Yes
ggplot2 No Yes

Because our attention is primarily focused on practical data analyses, we won’t elaborate on the grid package in this chapter. (If you’re interested, refer to Dr. Murrell’s Grid website [www.stat.auckland.ac.nz/~paul/grid/grid.html] for details on this package.) Instead, we’ll explore the lattice and ggplot2 packages in some detail. Each allows you to create unique and useful graphs that aren’t easily created in other ways.

16.2. The lattice package

The lattice package provides a comprehensive graphical system for visualizing uni-variate and multivariate data. In particular, many users turn to the lattice package because of its ability to easily generate trellis graphs.

A trellis graph displays the distribution of a variable or the relationship between variables, separately for each level of one or more other variables. Consider the following question: How do the heights of singers in the New York Choral Society vary by their vocal parts?

Data on the heights and voice parts of choral members is provided in the singer dataset contained in the lattice package. In the following code

library(lattice)
histogram(~height | voice.part, data = singer,
    main="Distribution of Heights by Voice Pitch",
    xlab="Height (inches)")

height is the dependent variable, voice.part is called the conditioning variable, and a histogram is created for each of the eight voice parts. The graph is shown in figure 16.1. It appears that tenors and basses tend to be taller than altos and sopranos.

Figure 16.1. Trellis graph of singer heights by voice pitch

In trellis graphs, a separate panel is created for each level of the conditioning variable. If more than one conditioning variable is specified, a panel is created for each combination of factor levels. The panels are arranged into an array to facilitate comparisons. A label is provided for each panel in an area called the strip. As you’ll see, the user has control over the graph displayed in each panel, the format and placement of the strip, the arrangement of the panels, the placement and content of legends, and many other graphic features.

The lattice package provides a wide variety of functions for producing univariate (dot plots, kernel density plots, histograms, bar charts, box plots), bivariate (scatter plots, strip plots, parallel box plots), and multivariate (3D plots, scatter plot matrices) graphs.

Each high-level graphing function follows the format

graph_function(formula, data=, options)

where:

  • graph_function is one of the functions listed in the second column of table 16.2.
    Table 16.2. Graph types and corresponding functions in the lattice package

    Graph type

    Function

    Formula examples

    3D contour plot contourplot() z~x*y
    3D level plot levelplot() z~y*x
    3D scatter plot cloud() z~x*y|A
    3D wireframe graph wireframe() z~y*x
    Bar chart barchart() x~A or A~x
    Box plot bwplot() x~A or A~x
    Dot plot dotplot() ~x|A
    Histogram histogram() ~x
    Kernel density plot densityplot() ~x|A*B
    Parallel coordinates plot parallel() dataframe
    Scatter plot xyplot() y~x|A
    Scatter plot matrix splom() dataframe
    Strip plots stripplot() A~x or x~A
    Note: In these formulas, lowercase letters represent numeric variables and uppercase letters represent categorical variables.
  • formula specifies the variable(s) to display and any conditioning variables.
  • data specifies a data frame.
  • options are comma-separated parameters used to modify the content, arrangement, and annotation of the graph. See table 16.3 for a description of common options.
    Table 16.3. Common options for lattice high-level graphing functions

    Options

    Description

    aspect A number specifying the aspect ratio (height/width) for the graph in each panel.
    col, pch, lty, lwd Vectors specifying the colors, symbols, line types, and line widths to be used in plotting, respectively.
    groups Grouping variable (factor).
    index.cond List specifying the display order of the panels.
    key (or auto.key) Function used to supply legend(s) for grouping variable(s).
    layout Two-element numeric vector specifying the arrangement of the panels (number of columns, number of rows). If desired, a third element can be added to indicate the number of pages.
    main, sub Character vectors specifying the main title and subtitle.
    panel Function used to generate the graph in each panel.
    scales List providing axis annotation information.
    strip Function used to customize panel strips.
    split, position Numeric vectors used to place more than one graph on a page.
    type Character vector specifying one or more plotting options for scatter plots (p=points, l=lines, r=regression line, smooth=loess fit, g=grid, and so on).
    xlab, ylab Character vectors specifying horizontal and vertical axis labels.
    xlim, ylim Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively.

Let lowercase letters represent numeric variables and uppercase letters represent categorical variables (factors). The formula in a high-level graphing function typically takes the form

y ~ x | A * B

where variables on the left side of the vertical bar are called the primary variables and variables on the right are the conditioning variables. Primary variables map variables to the axes in each panel. Here, y~x describes the variables to place on the vertical and horizontal axes, respectively. For single-variable plots, replace y~x with ~x. For 3D plots, replace y~x with z~x*y. Finally, for multivariate plots (scatter plot matrix or parallel coordinates plot) replace y~x with a data frame. Note that conditioning variables are always optional.

Following this logic, ~x|A displays numeric variable x for each level of factor A. y~x|A*B displays the relationship between numeric variables y and x separately for every combination of factor A and B levels. A~x displays categorical variable A on the vertical axis and numeric variable x on the horizontal axis. ~x displays numeric variable x alone. Other examples are shown in table 16.2.

To gain a quick overview of lattice graphs, try running the code in listing 16.1. The graphs are based on the automotive data (mileage, weight, number of gears, number of cylinders, and so on) included in the mtcars data frame. You may want to vary the formulas and view the results. (The resulting output has been omitted to save space.)

Listing 16.1. lattice plot examples

High-level plotting functions in the lattice package produce graphic objects that can be saved and manipulated. For example,

library(lattice)
mygraph <- densityplot(~height|voice.part, data=singer)

creates a trellis density plot and saves it as object mygraph. But no graph is displayed. Issuing the statement plot(mygraph) (or simply mygraph) will display the graph.

It’s easy to modify lattice graphs through the use of options. Common options are given in table 16.3. You’ll see examples of many of these later in the chapter.

You can issue these options in the high-level function calls or within the panel functions discussed in section 16.2.2.

You can also use the update() function to modify a lattice graphic object. Continuing the singer example, the following

update(mygraph, col="red", pch=16, cex=.8, jitter=.05, lwd=2)

would redraw the graph using red curves and symbols (color="red"), filled dots (pch=16), smaller (cex=.8) and more highly jittered points (jitter=.05), and curves of double thickness (lwd=2). Now that we’ve reviewed the general structure of a high-level lattice function, let’s look at conditioning variables in more detail.

16.2.1. Conditioning variables

As you’ve seen, one of the most powerful features of lattice graphs is the ability to add conditioning variables. If one conditioning variable is present, a separate panel is created for each level. If two conditioning variables are present, a separate panel is created for each combination of levels for the two variables. It’s rarely useful to include more than two conditioning variables.

Typically, conditioning variables are factors. But what if you want to condition on a continuous variable? One approach would be to transform the continuous variable into a discrete variable using R’s cut() function. Alternatively, the lattice package provides functions for transforming a continuous variable into a data structure called a shingle. Specifically, the continuous variable is divided up into a series of (possibly) overlapping ranges. For example, the function

myshingle <- equal.count(x, number=#, overlap=proportion)

will take continuous variable x and divide it up into # intervals, with proportion overlap, and equal numbers of observations in each range, and return it as the variable myshingle (of class shingle). Printing or plotting this object (for example, plot(myshingle)) will display the shingle’s intervals.

Once a continuous variable has been converted to a shingle, you can use it as a conditioning variable. For example, let’s use the mtcars dataset to explore the relationship between miles per gallon and car weight conditioned on engine displacement. Because engine displacement is a continuous variable, first let’s convert it to a shingle variable with three levels:

displacement <- equal.count(mtcars$disp, number=3, overlap=0)

Next, use this variable in the xyplot() function:

xyplot(mpg~wt|displacement, data=mtcars,
   main = "Miles per Gallon vs. Weight by Engine Displacement",
   xlab = "Weight", ylab = "Miles per Gallon",
   layout=c(3, 1), aspect=1.5)

The results are shown in figure 16.2. Note that we’ve also used options to modify the layout of the panels (three columns and one row) and the aspect ratio (height/width) in order to make comparisons among the three groups easier.

Figure 16.2. Trellis plot of mpg versus car weight conditioned on engine displacement. Because engine displacement is a continuous variable, it has been converted to three nonoverlapping shingles with equal numbers of observations.

You can see that the labels in the panel strips of figure 16.1 and figure 16.2 differ. The representation in figure 16.2 indicates the continuous nature of the conditioning variable, with the darker color indicating the range of values for the conditioning variable in the given panel. In the next section, we’ll use panel functions to customize the output further.

16.2.2. Panel functions

Each of the high-level plotting functions in table 16.2 employs a default function to draw the panels. These default functions follow the naming convention panel. graph_function, where graph_function is the high-level function. For example,

xyplot(mpg~wt|displacement, data=mtcars)

could have also be written as

xyplot(mpg~wt|displacement, data=mtcars, panel=panel.xyplot)

This is a powerful feature because it allows you to replace the default panel function with a customized function of your own design. You can incorporate one or more of the 50+ default panel functions in the lattice package into your customized function as well. Customized panel functions give you a great deal of flexibility in designing an output that meets your needs. Let’s look at some examples.

In the previous section, you plotted gas mileage by automobile weight, conditioned on engine displacement. What if you wanted to include regression lines, rug plots, and grid lines? You can do this by creating your own panel function (see the following listing). The resulting graph is provided in figure 16.3.

Figure 16.3. Trellis plot of mpg versus car weight conditioned on engine displacement. A custom panel function has been used to add regression lines, rug plots, and grid lines.

Listing 16.2. xyplot with custom panel function

Here, we’ve wrapped four separate building block functions into our own mypanel() function and applied it within xyplot() through the panel= option . The panel. xyplot() function generates the scatter plot using a filled circle (pch=19). The panel. rug() function adds rug plots to both the x and y axes of each panel. panel.rug(x, FALSE) or panel.rug(FALSE, y) would have added rugs to just the horizontal or vertical axis, respectively. The panel.grid() function adds horizontal and vertical grid lines (using negative numbers forces them to line up with the axis labels). Finally, the panel.lmline() function adds a regression line that’s rendered as red (col="red"), dashed (lty=2) lines, of standard thickness (lwd=1). Each default panel function has its own structure and options. See the help page on each (for example, help(panel. abline)) for further details.

As a second example, we’ll graph the relationship between gas mileage and engine displacement (considered as a continuous variable), conditioned on type of automobile transmission. In addition to creating separate panels for automatic and manual transmission engines, we’ll add smoothed fit lines and horizontal mean lines. The code is given in the following listing.

Listing 16.3. xyplot with custom panel function and additional options
library(lattice)
mtcars$transmission <- factor(mtcars$am, levels=c(0,1),
                              labels=c("Automatic",  "Manual"))

panel.smoother <- function(x, y) {
                    panel.grid(h=-1, v=-1)
                    panel.xyplot(x, y)
                    panel.loess(x, y)
                    panel.abline(h=mean(y), lwd=2, lty=2, col="green")
                  }

xyplot(mpg~disp|transmission,data=mtcars,
       scales=list(cex=.8, col="red"),
       panel=panel.smoother,
       xlab="Displacement", ylab="Miles per Gallon",
       main="MGP vs Displacement by Transmission Type",
       sub = "Dotted lines are Group Means", aspect=1)

The graph produced by this code is provided in figure 16.4.

Figure 16.4. Trellis graph of mpg versus engine displacement conditioned on transmission type. Smoothed lines (loess), grids, and group mean levels have been added.

There are several things to note in this new code. The panel.xyplot() function plots the individual points, and the panel.loess() function plots nonparametric fit lines in each panel. The panel.abline() function adds horizontal reference lines at the mean mpg value for each level of the conditioning variable. (If we had replaced h=mean(y) with h=mean(mtcars$mpg), a single reference line would have been drawn at the mean mpg value for the entire sample.) The scales= option renders scale annotations in red and at 80 percent of their default size.

In the previous example, we could have used scales=list(x=list(), y=list()) to specify separate options for the horizontal and vertical axes. See help(xyplot) for details on the many scale options available. In the next section, you’ll learn how to superimpose data from groups of observations, rather than presenting them in separate panels.

16.2.3. Grouping variables

When you include a conditioning variable in a lattice graph formula, a separate panel is produced for each level of that variable. If you want to superimpose the results for each level instead, you can specify the variable as a group variable.

Let’s say that you want to display the distribution of gas mileage for cars with manual and automatic transmissions using kernel density plots. You can superimpose these plots using this code:

library(lattice)
mtcars$transmission <- factor(mtcars$am, levels=c(0, 1),
                              labels=c("Automatic", "Manual"))
densityplot(~mpg, data=mtcars,
            group=transmission,
            main="MPG Distribution by Transmission Type",
            xlab="Miles per Gallon",
            auto.key=TRUE)

The resulting graph is presented in figure 16.5. By default, the group=option superimposes the plots from each level of the grouping variable. Points are plotted as open circles, lines are solid, and level information is distinguished by color. As you can see, the colors are difficult to differentiate when printed in grayscale. Later you’ll see how to change these defaults.

Figure 16.5. Kernel density plots for miles per gallon grouped by transmission type. Jittered points are provided on the horizontal axis.

Note that legends or keys aren’t produced by default. The option auto.key=TRUE will create a rudimentary legend and place it above the graph. You can make limited changes to this automated key by specifying options in a list. For example,

auto.key=list(space="right", columns=1, title="Transmission")

would move the legend to the right of the graph, present the key values in a single column, and add a legend title.

If you want to exert greater control over the legend, you can use the key= option. An example is given in listing 16.4. The resulting graph is provided in figure 16.6.

Figure 16.6. Kernel density plots for miles per gallon grouped by transmission type. Graphical parameters have been modified and a customized legend has been added. The custom legend specifies color, shape, line type, character size, and title.

Listing 16.4. Kernel density plot with a group variable and customized legend

Here, the plotting symbols, line types, and colors are specified as vectors . The first element of each vector will be applied to the first level of the group variable, the second element to the second level, and so forth. A list object is created to hold the legend options . These options place the legend below the graph in two columns, and include the level names, point symbols, line types, and colors. The legend title is rendered slightly larger than the text for the symbols.

The same plot symbols, line types, and colors are specified within the densityplot() function . Additionally, the line width and jitter are increased to improve the appearance of the graph. Finally, the key is set to use the previously defined list. This approach to specifying a legend for the grouping variable allows a great deal of flexibility. In fact, you can create more than one legend and place them in different areas of the graph (not shown here).

Before completing this section, let’s consider an example that includes group and conditioning variables in a single plot. The CO2 data frame, included with the base R installation, describes a study of cold tolerance of the grass species Echinocholoa crus-galli.

The data describe carbon dioxide uptake rates (uptake) for 12 plants (Plant), at seven ambient carbon dioxide concentrations (conc). Six plants were from Quebec and six plants were from Mississippi. Three plants from each location were studied under chilled conditions and three plants were studied under nonchilled conditions. In this example, Plant is the group variable and both Type (Quebec/Mississippi) and Treatment (chilled/nonchilled) are conditioning variables. The following code produces the plot in figure 16.7.

Figure 16.7. xyplot showing the impact of ambient carbon dioxide concentrations on carbon dioxide uptake for 12 plants in two treatment conditions and two types. Plant is the group variable and Treatment and Type are the conditioning variables.

Listing 16.5. xyplot with group and conditioning variables and customized legend
library(lattice)
colors <- "darkgreen"
symbols <- c(1:12)
linetype <- c(1:3)

key.species <- list(title="Plant",
                    space="right",
                    text=list(levels(CO2$Plant)),
                    points=list(pch=symbols, col=colors))

xyplot(uptake~conc|Type*Treatment, data=CO2,
       group=Plant,
       type="o",
       pch=symbols, col=colors, lty=linetype,
       main="Carbon Dioxide Uptake
in Grass Plants",
       ylab=expression(paste("Uptake ",
              bgroup("(", italic(frac("umol","m"^2)), ")"))),
       xlab=expression(paste("Concentration ",
               bgroup("(", italic(frac(mL,L)), ")"))),
       sub = "Grass Species: Echinochloa crus-galli",
       key=key.species)

Note the use of to give you a two-line title and the use of the expression() function to add mathematical notation to the axis labels. Here, color is suppressed as a group differentiator by specifying a single color in the col= option. In this case, adding 12 different colors is overkill and distracts from the goal of easily visualizing the relationships in each panel. Clearly, there’s something different about the Mississippi grasses in the chilled condition.

Up to this point, you’ve been modifying graphic elements in your charts through options passed to either the high-level graph function (for example, xyplot(pch=17)) or within the panel functions that they use (for example, panel.xyplot(pch=17)). But such changes are in effect only for the duration of the function call. In the next section, we’ll review a method for changing graphical parameters that persists for the duration of the interactive session or batch execution.

16.2.4. Graphic parameters

In chapter 3, you learned how to view and set default graphics parameters using the par() function. Although this works for graphs produced with R’s native graphic system, lattice graphs are unaffected by these settings. Instead, the graphic defaults used by lattice functions are contained in a large list object that can be accessed with the trellis.par.get() function and modified through the trellis.par.set() function. The show.settings() function can be used to display the current graphic settings visually.

As an example, let’s change the default symbol used for superimposed points (that is, points in a graph that includes a group variable). The default is an open circle. We’ll give each group their own symbol instead.

First, view the current defaults and save them into a list called mysettings:

> show.settings()
> mysettings <- trellis.par.get()

Next, look at the defaults that are specific to superimposed symbols:

> mysettings$superpose.symbol
$alpha
[1] 1 1 1 1 1 1 1

$cex
[1] 0.8 0.8 0.8 0.8 0.8 0.8 0.8

$col
[1] "#0080ff"   "#ff00ff"   "darkgreen" "#ff0000"   "orange"    "#00ff00"
[7] "brown"

$fill
[1] "#CCFFFF" "#FFCCFF" "#CCFFCC" "#FFE5CC" "#CCE6FF" "#FFFFCC" "#FFCCCC"

$font
[1] 1 1 1 1 1 1 1

$pch
[1] 1 1 1 1 1 1 1

Here you see that the symbol used for each level of a group variable is an open circle (pch=1). Seven levels are defined, after which symbols recycle.

Finally, issue the following statements:

mysettings$superpose.symbol$pch <- c(1:10)
trellis.par.set(mysettings)
show.settings()

Lattice graphs now use symbol 1 (open circle) for the first level of a group variable, symbol 2 (open triangle) for the second, and so on. Additionally, symbols have been defined for 10 levels of a grouping variable, rather than 7. The changes will remain in effect until all graphic devices are closed. You can change any graphic setting in this manner.

16.2.5. Page arrangement

In chapter 3 you learned how to place more than one graph on a page using the par() function. Because lattice functions don’t recognize par() settings, you’ll need a different approach. The easiest method involves saving your lattice graphs as objects, and using the plot() function with either the split= or position= option specified.

The split option divides a page up into a specified number of rows and columns and places graphs into designated cells of the resulting matrix. The format for the split option is

split=c(placement row, placement column,
    total number of rows, total number of columns)

For example, the following code

library(lattice)
graph1 <- histogram(~height|voice.part, data=singer,
                    main="Heights of Choral Singers by Voice Part")
graph2 <- densityplot(~height, data=singer, group=voice.part,
                      plot.points=FALSE, auto.key=list(columns=4))
plot(graph1, split=c(1, 1, 1, 2))
   plot(graph2, split=c(1, 2, 1, 2), newpage=FALSE)

places the first graph directly above the second graph. Specifically, the first plot() statement divides the page up into one column and two rows and places the graph in the first column and first row (counting top-down and left-right). The second plot() statement divides the page up in the same way, but places the graph in the first column and second row. Because the plot() function starts a new page by default, you suppress this action by including the newpage=FALSE option. (I’ve omitted the graph to save space.)

You can gain more control of sizing and placement by using the position= option. Consider the following code:

library(lattice)
graph1 <- histogram(~height|voice.part, data=singer,
                    main="Heights of Choral Singers by Voice Part")
graph2 <- densityplot(~height, data=singer, group=voice.part,
                      plot.points=FALSE, auto.key=list(columns=4))
plot(graph1, position=c(0, .3, 1, 1))
plot(graph2, position=c(0, 0, 1, .3), newpage=FALSE)

Here, position=c(xmin, ymin, xmax, ymax), where the x-y coordinate system for the page is a rectangle with dimensions ranging from 0 to 1 on both the x and y axes, and the origin (0,0) at the bottom left. (Again, the resulting graph is omitted to save space.)

You can also change the order of the panels in a lattice graph. The index.cond= option in a high-level lattice graph function specifies the order of the conditioning variable levels. For the voice.part factor, the levels are

> levels(singer$voice.part)
[1] "Bass 2"    "Bass 1"    "Tenor 2"   "Tenor 1"   "Alto 2"    "Alto 1"
[7] "Soprano 2" "Soprano 1"

Adding index.cond=list(c(2, 4, 6, 8, 1, 3, 5, 7)) would place the "1" voice parts together, followed by "2" voice parts. When there are two conditioning variables, include two vectors in the list. In listing 16.5, adding index.cond=list(c(1, 2), c(2, 1)) would reverse the order of treatments in figure 16.7.

To learn more about lattice graphs, take a look the excellent text by Sarkar (2008) and its supporting website at http://lmdvr.r-forge.r-project.org. The Trellis Graphics User’s Manual (http://cm.bell-labs.com/cm/ms/departments/sia/doc/trellis.user.pdf) is also an excellent source of information.

In the next section, we’ll explore a second comprehensive alternative to R’s native graphic system. This one is based on the ggplot2 package.

16.3. The ggplot2 package

The ggplot2 package implements a system for creating graphics in R based on a comprehensive and coherent grammar. This provides a consistency to graph creation often lacking in R, and allows the user to create graph types that are innovative and novel.

The simplest approach for creating graphs in ggplot2 is through the qplot() or quick plot function. The format is

qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=,
    facets=, xlim=, ylim=, xlab=, ylab=, main=, sub=)

where the parameters/options are defined in table 16.4.

Table 16.4. qplot options

Option

Description

alpha Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity).
color,shape, size, fill Associates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically.
data Specifies a data frame.
facets Creates a trellis graph by specifying conditioning variables. Its value is expressed as rowvar ~ colvar (see the example in figure 16.10). To create trellis graphs based on a single conditioning variable, use rowvar-. or .-colvar.
geom Specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".
main, sub Character vectors specifying the title and subtitle.
method, formula If geom="smooth", a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression. The formula parameter gives the form of the fit. For example, to add simple linear regression lines, you’d specify geom="smooth", method="lm", formula=y-x. Changing the formula to y-poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables. For method="gam", be sure to load the mgcv package. For method="rml", load the MASS package.
x, y Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y.
xlab, ylab Character vectors specifying horizontal and vertical axis labels.
xlim, ylim Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively.

To see how qplot() works, let’s review some examples. The following code creates box plots of gas mileage by number of cylinders. The actual data points are superimposed (and jittered to reduce overlap). Box plot colors vary by number of cylinders.

library(ggplot2)
mtcars$cylinder <- as.factor(mtcars$cyl)
qplot(cylinder, mpg, data=mtcars, geom=c("boxplot", "jitter"),
      fill=cylinder,
      main="Box plots with superimposed data points",
      xlab= "Number of Cylinders",
        ylab="Miles per Gallon")

The graph is displayed in figure 16.8.

Figure 16.8. Box plots of auto mileage by number of cylinders. Data points are superimposed and jittered.

As a second example, let’s create a scatter plot matrix of gas mileage by car weight and use color and symbol shape to differentiate cars with automatic transmissions from those with manual transmissions. Additionally, we’ll add separate regression lines and confidence bands for each transmission type.

library(ggplot2)
transmission <- factor(mtcars$am, levels=c(0, 1),
                       labels=c("Automatic",  "Manual"))
qplot(wt,mpg, data=mtcars,
      color=transmission, shape=transmission,
      geom=c("point", "smooth"),
      method="lm", formula=y~x,
      xlab="Weight", ylab="Miles Per Gallon",
      main="Regression Example")

The resulting graph is provided in figure 16.9. This is a useful type of graph, not easily created using other packages.

Figure 16.9. Scatter plot between auto mileage and car weight, with separate regression lines and confidence bands by engine transmission type (manual, automatic)

As a third example, we’ll create a faceted (trellis) graph. Each facet (panel) displays the scatter plot between gas mileage and car weight. Row facets are defined by the transmission type, whereas column facets are defined by the number of cylinders present. The size of each data point represents the car’s horsepower rating.

library(ggplot2)
mtcars$cyl <- factor(mtcars$cyl, levels=c(4, 6, 8),
                     labels=c("4 cylinders", "6 cylinders", "8 cylinders"))
mtcars$am <- factor(mtcars$am, levels=c(0, 1),
                    labels=c("Automatic",  "Manual"))
qplot(wt,mpg, data=mtcars, facets=am~cyl, size=hp)

The graph is displayed in figure 16.10. Note how simple it is to create a complex graph (actually a bubble chart). You may want to try adding shape and color options to the function call and see how the resulting graph is affected.

Figure 16.10. Scatter plot between auto mileage and car weight, faceted by transmission type (manual, automatic) and number of cylinders (4, 6, or 8). Symbol size represents horsepower.

We’ll end this section by revisiting the singer data with which we began the chapter. This code produces the graph in figure 16.11:

library(ggplot2)
data(singer, package="lattice")
qplot(height, data=singer, geom=c("density"),
      facets=voice.part~., fill=voice.part)
Figure 16.11. Faceted density plots for singer heights by voice part

Comparing the distribution of heights is easier in this format than in the format presented in figure 16.1. (Once again, this looks better when displayed in color.)

We’ve only scratched the surface of this powerful graphical system. Interested readers are referred to Wickham (2009), and the ggplot2 website (http://had.co.nz/ggplot2/) for more information. We’ll end this chapter with a review of interactive graphics and R functions that support them.

16.4. Interactive graphs

The base installation of R provides limited interactivity with graphs. You can modify graphs by issuing additional program statements, but there’s little that you can do to modify them or gather new information from them using the mouse. However, there are contributed packages that greatly enhance your ability to interact with the graphs you create. In this section, we’ll focus on functions provided by the playwith, latti-cist, iplots, and rggobi packages. Be sure to install them before first use.

16.4.1. Interacting with graphs: identifying points

Before getting to the specialize packages, let’s review a function in the base R installation that allows you to identify and label points in scatter plots. Using the identify() function, you can label selected points in a scatter plot with their row number or row name using your mouse. Identification continues until you select Stop or right-click on the graph. For example, after issuing the following statements

plot(mtcars$wt, mtcars$mpg)
identify(mtcars$wt, mtcars$mpg, labels=row.names(mtcars))

the cursor will change from a pointer to a crosshair. Clicking on scatter plot points will label them until you select Stop from the Graphics Device menu or right-click on the graph and select Stop from the context menu.

Many graphic functions in contributed packages (including functions from the car package discussed in chapter 8) employ this method for labeling points. Unfortunately, the identify() function doesn’t work with lattice or ggplot2 graphs.

16.4.2. playwith

The playwith package provides a GTK+ graphical user interface that allows users to edit and interact with R plots. You can install the playwith package on any platform using install.packages("playwith", depend=TRUE). On platforms running Mac OS X and Linux, it’s best to also install the JGR graphic user interface (see appendix A), and run playwith from within this GUI.

The playwith() function allows users to identify and label points, view all variable values for an observation, zoom and pan, add annotations (text, arrows, lines, rectangles, titles, labels), change visual elements (colors, text sizes, and so on), apply previously saved styles, and output the resulting graph in a variety of formats. This is easily demonstrated with an example. After running the following code

library(playwith)
library(lattice)

playwith(
   xyplot(mpg~wt|factor(cyl)*factor(am),
          data=mtcars, subscripts=TRUE,
          type=c("r", "p"))
)

the window in figure 16.12 will appear on the screen. Try out the buttons on the left, as well as the menu items. The GUI is fairly self-explanatory. Unlike the identify() function, playwith() works with lattice and ggplot2 graphs as well as base R graphs. Some options in the Theme menu only work properly with base graphics. Additionally, some features work with ggplot2 graphs (such as annotating) and some don’t (such as identifying points).

Figure 16.12. The playwith window. The user can edit the graph using the mouse with this GTK+ GUI.

To learn more about the playwith package, visit the project website at http://code.google.com/p/playwith/.

16.4.3. latticist

The latticist package lets you explore a data set using lattice displays. It provides a graphic user interface to the graphs described in section 16.2, but it can also be used to create displays from the vcd package (see chapter 11, section 11.4). If desired, latticist and can also be integrated with playwith. For example, executing the following code

library(latticist)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
latticist(mtcars, use.playwith=TRUE)

will bring up the interface in figure 16.13.

Figure 16.13. playwith window with latticist functionality. The user can create lattice and vcd graphs interactively.

In addition to having the playwith functionality (point identification, annotation, zooming, panning, styles), the user can now create lattice graphs by selecting from drop-down menus and buttons. To learn more about the latticist package, see http://code.google.com/p/latticist/.

A similar interface is available for ggplot2 graphs, through Plot Builder, a plug-in for Deducer, a popular GUI for R (see appendix A). Because it can’t be run from the R console, we won’t discuss it here. If you’re interested, visit the Deducer website at www.deducer.org.

16.4.4. Interactive graphics with the iplots package

Whereas playwith and latticist allow you to interact with a single graph, the iplots package takes interaction in a different direction. This package provides interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots, and histograms that can be linked together and color brushed. This means that you can select and identify observations using the mouse, and highlighting observations in one graph will automatically highlight the same observations in all other open graphs. You can also use the mouse to obtain information about graphic objects such as points, bars, lines, and box plots.

The iplots package is implemented through Java and the primary functions are listed in table 16.5.

Table 16.5. iplot functions

Function

Description

ibar() Interactive bar chart
ibox() Interactive box plot
ihist() Interactive histogram
imap() Interactive map
imosaic() Interactive mosaic plot
ipcp() Interactive parallel coordinates plot
iplot() Interactive scatter plot

To understand how iplots works, execute the code provided in listing 16.6.

Listing 16.6. iplots demonstration
library(iplots)
attach(mtcars)
cylinders <- factor(cyl)
gears <- factor(gear)
transmission <- factor(am)
ihist(mpg)
ibar(gears)
iplot(mpg, wt)
ibox(mtcars[c("mpg", "wt", "qsec", "disp", "hp")])
ipcp(mtcars[c("mpg", "wt", "qsec", "disp", "hp")])
imosaic(transmission, cylinders)
detach(mtcars)

Six windows containing graphs will open. Rearrange them on the desktop so that each is visible (each can be resized if necessary). A portion of the display is provided in figure 16.14.

Figure 16.14. An iplots demonstration created by listing 16.6. Only four of the six windows are displayed to save room. In these graphs, the user has clicked on the three-gear bar in the bar chart window.

Now try the following:

  • Click on the three-gear bar in the Barchart (gears) window. The bar will turn red. In addition, all cars with three-gear engines will be highlighted in the other graphic windows.
  • Mouse down and drag to select a rectangular region of points in the Scatter plot (wt vs mpg) window. These points will be highlighted and the corresponding observations in every other graphics window will also turn red.
  • Hold down the Ctrl key and move the mouse pointer over a point, bar, box plot, or line in one of the graphs. Details about that object will appear in a pop-up window.
  • Right-click on any object and note the options that are offered in the context menu. For example, you can right-click on the Boxplot (mpg) window and change the graph to a parallel coordinates plot (PCP).
  • You can drag to select more than one object (point, bar, and so on) or use Shift-click to select noncontiguous objects. Try selecting both the three- and five-gear bars in the Barchart (gears) window.

The functions in the iplots package allow you to explore the variable distributions and relationships among variables in subgroups of observations that you select interactively. This can provide insights that would be difficult and time-consuming to obtain in other ways. For more information on the iplots package, visit the project website at http://rosuda.org/iplots/.

16.4.5. rggobi

For our final example of interactivity, we’ll actually look beyond the R platform to the open source GGobi application (www.ggobi.org). GGobi is a comprehensive program for the visual and dynamic exploration of high-dimensional data and is freely available for Windows, Mac OS X, and Linux platforms. It offers a number of attractive features, including linked interactive scatter plots, bar charts, parallel coordinate plots, time series plots, scatter plot matrices, and 3D rotation; brushing and identification; multivariate transformation methods; and sophisticated exploratory support, including guided and manual 1D and 2D tours. Happily, the rggobi package provides a seamless interface between GGobi and R.

The first step in using GGobi is to download and install the appropriate software for your platform (www.ggobi.org/downloads/). Then install the rggobi package within R using install.packages("rggobi", depend=TRUE).

Once you’ve installed both, you can use the ggobi() function to run GGobi from within R. This gives you sophisticated interactive graphics access to all of your R data. To see this in action, execute the following code:

library(rggobi)
g <- ggobi(mtcars)

The GGobi interface will open and allow you to explore the mtcars dataset in a highly interactive fashion. To learn more, review the introduction, tutorial, manual, and video guides available on the GGobi website. A comprehensive overview is also provided in Cook and Swayne (2008).

16.5. Summary

In this chapter, we reviewed several packages that provide access to advanced graphical methods. We started with the lattice package, designed to provide a system for creating trellis graphs, followed by the ggplot2 package, based on a comprehensive grammar of graphics. Both packages are designed to provide you with a complete and comprehensive alternative to the native graphics provided with R. Each offers methods of creating attractive and meaningful visualizations of data that are difficult to generate in other ways.

We then explored several packages for dynamically interacting with graphs, including playwith, latticist, iplots, and rggobi. These packages allow you to interact directly with data in graphs, leading to a greater intimacy with your data and expanded opportunities for developing insights.

You should now have a firm grasp of the many ways that R allows you to create visual representations of data. If a picture is worth a thousand words, and R provides a thousand ways to create a picture, then R must be worth a million words (or something to that effect). These resources are a testament to the hard and selfless work of the initial R development team and the thousands of hours of work contributed by package authors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset