Chapter 5
Gapminder1 is an independent foundation based in Stockholm, Sweden. Its mission is “to debunk devastating myths about the world by offering free access to a fact-based world view.” They provide free online tools, data, and videos “to better understand the changing world.” The initial development of Gapminder was the Trendalyzer software, used by Hans Rosling in several sequences of his documentary “The Joy of Stats.”
The information visualization technique used by Trendalyzer is an interactive bubble chart. By default it shows five variables: two numeric variables on the vertical and horizontal axes, bubble size and color, and a time variable that may be manipulated with a slider. The software uses brushing and linking techniques for displaying the numeric value of a highlighted country.
This software was acquired by Google® in 2007, and is now available as a Motion Chart gadget and as the Public Data Explorer.
In this chapter, time will be used as a complementary variable which adds information to a graph where several variables are confronted. We will illustrate this approach with the evolution of the relationship between Gross National Income (GNI) and carbon dioxide (CO2) emissions for a set of countries extracted from the database of the World Bank Open Data. We will try several solutions to display the relationship between CO2 emissions and GNI over the years using time as a complementary variable. The final method will produce an animated plot resembling the Trendalyzer solution.
The first solution is a Motion Chart the googleVis package (Gesmann and Castillo 2011), an interface between R and the Google Visualisation API. With its gvisMotionChart function it is easy to produce a Motion Chart that can be displayed using a browser with Flash enabled (Figure 5.1).
load(’data/CO2.RData’)
library(googleVis)
pgvis <- gvisMotionChart(CO2data, idvar=’Country.Name’, timevar=’
Year’)
Although the gvisMotionChart is quite easy to use, the global appearance and behavior are completely determined by Google API2. Moreover, you should carefully read their Terms of Use before using it for public distribution.
Our next attempt is to display the entire data in a panel with a scatterplot using country names as the grouping factor. Points of each country are connected with polylines to reveal the time evolution (Figure 5.2).
## lattice version
xyplot(GNI.capita ~ CO2.capita, data=CO2data,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”,
groups=Country.Name, type=’b’)
## ggplot2 version
ggplot(data=CO2data, aes(x=CO2.capita, y=GNI.capita,
color=Country.Name)) +
xlab(“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”) +
ylab(“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”) +
geom_point() + geom_path() + theme_bw()
Three improvements can be added to this graphical result:
The Country.Name categorical variable will be encoded with a qualitative palette, namely the first five colors of Set1 palette3 from the RColorBrewer package (Neuwirth 2011). Because there are more countries than colors, we have to repeat some colors to complete the number of levels of the variable Country.Name. The result is a palette with non-unique colors, and thus some countries will share the same color. This is not a problem because the curves will be labeled, and countries with the same color will be displayed at enough distance.
library(RColorBrewer)
nCountries <- nlevels(CO2data$Country.Name)
pal <- brewer.pal(n=5, ’Set1’)
pal <- rep(pal, length = nCountries)
Adjacent colors of this palette are chosen to be easily distinguishable. Therefore, the connection between colors and countries must be in such a way that nearby lines are encoded with adjacent colors of the palette.
A simple approach is to calculate the annual average of the variable to be represented along the x-axis (CO2.capita), and extract colors from the palette according to the order of this value.
## Rank of average values of CO2 per capita
CO2mean <- aggregate(CO2.capita ~ Country.Name, data=CO2data, FUN=
mean)
palOrdered <- pal[rank(CO2mean$CO2.capita)]
A more sophisticated solution is to use the ordered results of a hierarchical clustering of the time evolution of the CO2 per capita values (Figure 5.3). The data is extracted from the original CO2 data.frame.
CO2capita <- CO2data[, c(’Country.Name’, ’Year’, ’CO2.capita’)]
CO2capita <- reshape(CO2capita, idvar=’Country.Name’, timevar=’Year’
, direction=’wide’)
hCO2 <- hclust(dist(CO2capita[, -1]))
oldpar <- par(mar=c(0, 2, 0, 0) + .1)
plot(hCO2, labels=CO2capita$Country.Name,
xlab=’’, ylab=’’, sub=’’, main=’’)
par(oldpar)
The colors of the palette are assigned to each country with match, which returns a vector of the positions of the matches of the country names in alphabetical order in the country names ordered according to the hierarchical clustering.
idx <- match(levels(CO2data$Country.Name),
It must be highlighted that this palette links colors with the levels of Country.Name (country names in alphabetical order), which is exactly what the groups argument provides. The following code produces a curve for each country using different colors to distinguish them.
## simpleTheme encapsulates the palette in a new theme for xyplot
myTheme <- simpleTheme(pch=19, cex=0.6, col=palOrdered)
pCO2.capita <- xyplot(GNI.capita ~ CO2.capita,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔
capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)
”,
groups=Country.Name, data=CO2data,
par.settings=myTheme,
type=’b’)
gCO2.capita <- ggplot(data=CO2data, aes(x=CO2.capita, y=GNI.capita,
color=Country.Name)) +
geom_point() + geom_path() +
scale_color_manual(values=palOrdered, guide=FALSE) +
xlab(’CO2⊔emissions⊔(metric⊔tons⊔per⊔capita)’) +
ylab(’GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)’) +
theme_bw()
This result can be improved with labels displaying the years to show the time evolution. A panel function with panel.text to print the year labels and panel.superpose to display the lines for each group is a solution. In the panel function, subscripts is a vector with the integer indices representing the rows of the data.frame to be displayed in the panel.
xyplot(GNI.capita ~ CO2.capita,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”,
groups=Country.Name, data=CO2data,
par.settings=myTheme,
type=’b’,
panel=function(x, y, ..., subscripts, groups){
panel.text(x, y, ...,
labels=CO2data$Year[subscripts],
pos=2, cex=0.5, col=’gray’)
panel.superpose(x, y, subscripts, groups,...)
}
)
The same result with a clearer code is obtained with the combination of +.trellis, glayer_ and panel.text. Using glayer_ instead of glayer, we ensure that the labels are printed below the lines.
pCO2.capita <- pCO2.capita +
glayer_(panel.text(..., labels=CO2data$Year[subscripts],
pos=2, cex=0.5, col=’gray’))
gCO2.capita <- gCO2.capita + geom_text(aes(label=Year),
colour=’gray’,
size=2.5,
hjust=0, vjust=0)
The common solution to link each curve with the group value is to add a legend. However, a legend can be confusing with too many items. In addition, the reader must carry out a complex task: Choose the line, memorize its color, search for it in the legend, and read the country name.
A better approach is to label each line using nearby text with the same color encoding. A suitable method is to place the labels close to the end of each line (Figure 5.4). Labels are placed with the panel.pointLabel function from the maptools package. This function use optimization routines to find locations without overlaps.
library(maptools)
## group.value provides the country name; group.number is the
## index of each country to choose the color from the palette.
pCO2.capita +
glayer(panel.pointLabel(mean(x), mean(y),
labels= group.value,
col=palOrdered[group.number],
cex=.8,
fontface=2, fontfamily=’Palatino’))
However, this solution does not solve the overlapping between labels and lines. The package directlabels (Hocking 2013) includes a wide repertory of positioning methods to cope with this problem. The main function, direct.label, is able to determine a suitable method for each plot, although the user can choose a different method from the collection or even define a custom method. For the pCO2.capita object, I have obtained the best results with extreme.grid (Figure 5.5).
library(directlabels)
direct.label(pCO2.capita, method=’extreme.grid’)
direct.label(gCO2.capita, method=’extreme.grid’)
Time can be used as a conditioning variable (as shown in previous sections) to display subsets of the data in different panels. Figure 5.6 is produced with the same code as in Figure 5.2, now including |factor(Year) in the lattice version and facet_wrap(~ Year) in the ggplot2 version.
xyplot(GNI.capita ~ CO2.capita | factor(Year), data=CO2data,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”,
groups=Country.Name, type=’b’,
auto.key=list(space=’right’))
ggplot(data=CO2data, aes(x=CO2.capita, y=GNI.capita, colour=Country.
Name)) +
facet_wrap(~ Year) + geom_point(pch=19) +
xlab(’CO2⊔emissions⊔(metric⊔tons⊔per⊔capita)’) +
ylab(’GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)’) +
theme_bw()
Because the grouping variable, Country.Name, has many levels, the legend is not very useful. Once again, point labeling is recommended (Figure 5.7).
xyplot(GNI.capita ~ CO2.capita | factor(Year), data=CO2data,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”,
groups=Country.Name, type=’b’,
par.settings=myTheme) +
glayer(panel.pointLabel(x, y, labels=group.value,
col=palOrdered[group.number], cex=0.7))
Instead of using simple points, we can display circles of different radius to encode a new variable. This new variable is CO2.PPP, the ratio of CO2 emissions to the Gross Domestic Product with purchasing power parity (PPP) estimations.
To use this numeric variable as an additional grouping factor, its range must be divided into different classes. The typical solution is to use cut to coerce the numeric variable into a factor whose levels correspond to uniform intervals, which could be unrelated to the data distribution. The classInt package (R. Bivand 2013) provides several methods to partition data into classes based on natural groups in the data distribution.
Although the functions of this package are mainly intended to create color palettes for maps, the results can also be associated to point sizes. cex.key defines the sequence of sizes (to be displayed in the legend) associated with each CO2.PPP using the findCols function.
nInt <- length(intervals$brks) - 1
cex.key <- seq(0.5, 1.8, length=nInt)
idx <- findCols(intervals)
CO2data$cexPoints <- cex.key[idx]
The graphic will display information on two variables (GNI.capita and CO2.capita in the vertical and horizontal axes, respectively) with a conditioning variable (Year) and two grouping variables (Country.Name, and CO2.PPP through cexPoints) (Figure 5.8).
ggplot(data=CO2data, aes(x=CO2.capita, y=GNI.capita, colour=Country.
Name)) +
facet_wrap(~ Year) + geom_point(aes(size=cexPoints), pch=19) +
xlab(’Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)’) +
ylab(’GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)’) +
theme_bw()
The auto.key mechanism of the lattice version is not able to cope with two grouping variables. Therefore, the legend, whose main componens are the labels (intervals) and the point sizes (cex.key), should be defined manually (Figure 5.9).
op <- options(digits=2)
tab <- print(intervals)
options(op)
key <- list(space=’right’,
title=expression(CO[2]/GNI.PPP),
cex.title=1,
## Labels of the key are the intervals strings
text=list(labels=names(tab), cex=0.85),
## Points sizes are defined with cex.key
points=list(col=’black’, pch=19,
cex=cex.key, alpha=0.7))
xyplot(GNI.capita ~ CO2.capita|factor(Year), data=CO2data,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”,
groups=Country.Name, key=key, alpha=0.7,
col=palOrdered, cex=CO2data$cexPoints) +
glayer(panel.pointLabel(x, y, labels=group.value,
col=palOrdered[group.number], cex=0.7))
The final solution to display this multivariate time series is with animation via the function grid.animate of the gridSVG package. We will mimic the Trendalyzer/Motion Chart solution, using traveling bubbles of different colors and with radius proportional to CO2.PPP.
The first step is to draw the initial state of the bubbles. Their colors are again defined by the palOrdered palette, although the adjustcolor function is used for a ligther fill color. Because there will not be a legend, there is no need to define class intervals, and thus the radius is directly proportional to the value of CO2data$CO2.PPP.
library(gridSVG)
xyplot(GNI.capita ~ CO2.capita, data=CO2data,
xlab=“Carbon⊔dioxide⊔emissions⊔(metric⊔tons⊔per⊔capita)”,
ylab=“GNI⊔per⊔capita,⊔PPP⊔(current⊔international⊔$)”,
subset=Year==2000, groups=Country.Name,
## The limits of the graphic are defined
## with the entire dataset
xlim=extendrange(CO2data$CO2.capita),
ylim=extendrange(CO2data$GNI.capita),
panel=function(x, y, ..., subscripts, groups) {
color <- palOrdered[groups[subscripts]]
radius <- CO2data$CO2.PPP[subscripts]
## Size of labels
cex <- 1.1*sqrt(radius)
## Bubbles
grid.circle(x, y, default.units=“native”,
r=radius*unit(.25, “inch”),
name=trellis.grobname(“points”, type=“panel”),
gp=gpar(col=color,
## Fill color ligther than border
fill=adjustcolor(color, alpha=.5),
lwd=2))
## Country labels
grid.text(label=groups[subscripts],
x=unit(x, ’native’),
## Labels above each bubble
y=unit(y, ’native’) + 1.5 * radius *unit(.25, ’inch’)
,
name=trellis.grobname(’labels’, type=’panel’),
gp=gpar(col=color, cex=cex))
})
From this initial state, grid.animate creates a collection of animated graphical objects with the result of animUnit. This function produces a set of values that will be interpreted by grid.animate as intermediate states of a feature of the graphical object. Thus, the bubbles will travel across the values defined by x_points and y_points, while their labels will use x_points and x_labels.
The use of rep=TRUE ensures that the animation will be repeated indefinitely.
## Duration in seconds of the animation
duration <- 20
nCountries <- nlevels(CO2data$Country.Name)
years <- unique(CO2data$Year)
nYears <- length(years)
## Intermediate positions of the bubbles
x_points <- animUnit(unit(CO2data$CO2.capita, ’native’),
id=rep(seq_len(nCountries), each=nYears))
y_points <- animUnit(unit(CO2data$GNI.capita, ’native’),
id=rep(seq_len(nCountries), each=nYears))
## Intermediate positions of the labels
y_labels <- animUnit(unit(CO2data$GNI.capita, ’native’) +
1.5 * CO2data$CO2.PPP * unit(.25, ’inch’),
id=rep(seq_len(nCountries), each=nYears))
## Intermediate sizes of the bubbles
size <- animUnit(CO2data$CO2.PPP * unit(.25, ’inch’),
id=rep(seq_len(nCountries), each=nYears))
grid.animate(trellis.grobname(“points”, type=“panel”, row=1, col=1),
duration=duration,
x=x_points,
y=y_points,
r=size,
rep=TRUE)
grid.animate(trellis.grobname(“labels”, type=“panel”, row=1, col=1),
duration=duration,
x=x_points,
y=y_labels,
rep=TRUE)
A bit of interactivity can be added with the grid.hyperlink function. For example, the following code adds the corresponding Wikipedia link to a mouse click on each bubble.
countries <- unique(CO2data$Country.Name)
URL <- paste(’http://en.wikipedia.org/wiki/’, countries, sep=’’)
grid.hyperlink(trellis.grobname(’points’, type=’panel’, row=1, col
=1),
URL, group=FALSE)
Finally, the time information: The year is printed in the lower right corner, using the visibility attribute of an animated textGrob object to show and hide the values.
visibility <- matrix(“hidden”, nrow=nYears, ncol=nYears)
diag(visibility) <- “visible”
yearText <- animateGrob(garnishGrob(textGrob(years, .9, .15,
name=“year”,
gp=gpar(cex=2, col=“grey”)),
visibility=“hidden”),
duration=20,
visibility=visibility,
rep=TRUE)
grid.draw(yearText)
The SVG file produced with grid.export is available at the website of the book (Figure 5.10). Because this animation does not trace the paths, Figure 5.5 provides this information as a static complement.
grid.export(“figs/bubbles.svg”)
Now, sit down in your favorite easy chair and watch the magistral video “200 Countries, 200 Years, 4 Minutes”4. After that, you are ready to open the SVG file of traveling bubbles: It is easier, a short time period with less than twenty countries.
2 You should read the Google API Terms of Service before using googleVis: https://developers.google.com/terms/.
4 http://www.gapminder.org/videos/200-years-that-changed-the-world-bbc/