Chapter 9. Advanced Looping

R’s looping capability goes far beyond the three standard-issue loops seen in the last chapter. It gives you the ability to apply functions to each element of a vector, list, or array, so you can write pseudo-vectorized code where normal vectorization isn’t possible. Other loops let you calculate summary statistics on chunks of data.

Chapter Goals

After reading this chapter, you should:

  • Be able to apply a function to every element of a list or vector, or to every row or column of a matrix
  • Be able to solve split-apply-combine problems
  • Be able to use the plyr package

Replication

Cast your mind back to Chapter 4 and the rep function. rep repeats its input several times. Another related function, replicate, calls an expression several times. Mostly, they do exactly the same thing. The difference occurs when random number generation is involved. Pretend for a moment that the uniform random number generation function, runif, isn’t vectorized. rep will repeat the same random number several times, but replicate gives a different number each time (for historical reasons, the order of the arguments is annoyingly back to front):

rep(runif(1), 5)
## [1] 0.04573 0.04573 0.04573 0.04573 0.04573
replicate(5, runif(1))
## [1] 0.5839 0.3689 0.1601 0.9176 0.5388

replicate comes into its own in more complicated examples: its main use is in Monte Carlo analyses, where you repeat an analysis a known number of times, and each iteration is independent of the others.

This next example estimates a person’s time to commute to work via different methods of transport. It’s a little bit complicated, but that’s on purpose because that’s when replicate is most useful.

The time_for_commute function uses sample to randomly pick a mode of transport (car, bus, train, or bike), then uses rnorm or rlnorm to find a normally or lognormally[25] distributed travel time (with parameters that depend upon the mode of transport):

time_for_commute <- function()
{
  #Choose a mode of transport for the day
  mode_of_transport <- sample(
    c("car", "bus", "train", "bike"),
    size = 1,
    prob = c(0.1, 0.2, 0.3, 0.4)
  )
  #Find the time to travel, depending upon mode of transport
  time <- switch(
    mode_of_transport,
    car   = rlnorm(1, log(30), 0.5),
    bus   = rlnorm(1, log(40), 0.5),
    train = rnorm(1, 30, 10),
    bike  = rnorm(1, 60, 5)
  )
  names(time) <- mode_of_transport
  time
}

The presence of a switch statement makes this function very hard to vectorize. That means that to find the distribution of commuting times, we need to repeatedly call time_for_commute to generate data for each day. replicate gives us instant vectorization:

replicate(5, time_for_commute())
##  bike   car train   bus  bike
## 66.22 35.98 27.30 39.40 53.81

Looping Over Lists

By now, you should have noticed that an awful lot of R is vectorized. In fact, your default stance should be to write vectorized code. It’s often cleaner to read, and invariably gives you performance benefits when compared to a loop. In some cases, though, trying to achieve vectorization means contorting your code in unnatural ways. In those cases, the apply family of functions can give you pretend vectorization,[26] without the pain.

The simplest and most commonly used family member is lapply, short for “list apply.” lapply takes a list and a function as inputs, applies the function to each element of the list in turn, and returns another list of results. Recall our prime factorization list from Chapter 5:

prime_factors <- list(
  two   = 2,
  three = 3,
  four  = c(2, 2),
  five  = 5,
  six   = c(2, 3),
  seven = 7,
  eight = c(2, 2, 2),
  nine  = c(3, 3),
  ten   = c(2, 5)
)
head(prime_factors)
## $two
## [1] 2
##
## $three
## [1] 3
##
## $four
## [1] 2 2
##
## $five
## [1] 5
##
## $six
## [1] 2 3
##
## $seven
## [1] 7

Trying to find the unique value in each list element is difficult to do in a vectorized way. We could write a for loop to examine each element, but that’s a little bit clunky:

unique_primes <- vector("list", length(prime_factors))
for(i in seq_along(prime_factors))
{
  unique_primes[[i]] <- unique(prime_factors[[i]])
}
names(unique_primes) <- names(prime_factors)
unique_primes
## $two
## [1] 2
##
## $three
## [1] 3
##
## $four
## [1] 2
##
## $five
## [1] 5
##
## $six
## [1] 2 3
##
## $seven
## [1] 7
##
## $eight
## [1] 2
##
## $nine
## [1] 3
##
## $ten
## [1] 2 5

lapply makes this so much easier, eliminating the nasty boilerplate code for worrying about lengths and names:

lapply(prime_factors, unique)
## $two
## [1] 2
##
## $three
## [1] 3
##
## $four
## [1] 2
##
## $five
## [1] 5
##
## $six
## [1] 2 3
##
## $seven
## [1] 7
##
## $eight
## [1] 2
##
## $nine
## [1] 3
##
## $ten
## [1] 2 5

When the return value from the function is the same size each time, and you know what that size is, you can use a variant of lapply called vapply. vapply stands for “list apply that returns a vector.” As before, you pass it a list and a function, but vapply takes a third argument that is a template for the return values. Rather than returning a list, it simplifies the result to be a vector or an array:

vapply(prime_factors, length, numeric(1))
##   two three  four  five   six seven eight  nine   ten
##     1     1     2     1     2     1     3     2     2

If the output does not fit the template, then vapply will throw an error. This makes it less flexible than lapply, since the output must be the same size for each element and must be known in advance.

There is another function that lies in between lapply and vapply: namely sapply, which stands for “simplifying list apply.” Like the two other functions, sapply takes a list and a function as inputs. It does not need a template, but will try to simplify the result to an appropriate vector or array if it can:

sapply(prime_factors, unique)  #returns a list
## $two
## [1] 2
##
## $three
## [1] 3
##
## $four
## [1] 2
##
## $five
## [1] 5
##
## $six
## [1] 2 3
##
## $seven
## [1] 7
##
## $eight
## [1] 2
##
## $nine
## [1] 3
##
## $ten
## [1] 2 5
sapply(prime_factors, length)  #returns a vector
##   two three  four  five   six seven eight  nine   ten
##     1     1     2     1     2     1     3     2     2
sapply(prime_factors, summary) #returns an array
##         two three four five  six seven eight nine  ten
## Min.      2     3    2    5 2.00     7     2    3 2.00
## 1st Qu.   2     3    2    5 2.25     7     2    3 2.75
## Median    2     3    2    5 2.50     7     2    3 3.50
## Mean      2     3    2    5 2.50     7     2    3 3.50
## 3rd Qu.   2     3    2    5 2.75     7     2    3 4.25
## Max.      2     3    2    5 3.00     7     2    3 5.00

For interactive use, this is wonderful because you usually automatically get the result in the form that you want. This function does require some care if you aren’t sure about what your inputs might be, though, since the result is sometimes a list and sometimes a vector. This can trip you up in some subtle ways. Our previous length example returned a vector, but look what happens when you pass it an empty list:

sapply(list(), length)
## list()

If the input list has length zero, then sapply always returns a list, regardless of the function that is passed. So if your data could be empty, and you know the return value, it is safer to use vapply:

vapply(list(), length, numeric(1))
## numeric(0)

Although these functions are primarily designed for use with lists, they can also accept vector inputs. In this case, the function is applied to each element of the vector in turn. The source function is used to read and evaluate the contents of an R file. (That is, you can use it to run an R script.) Unfortunately it isn’t vectorized, so if we wanted to run all the R scripts in a directory, then we need to wrap the directory in a call to lapply.

In this next example, dir returns the names of files in a given directory, defaulting to the current working directory. (Recall that you can find this with getwd.) The argument pattern = "\.R$" means “only return filenames that end with .R”:

r_files <- dir(pattern = "\.R$")
lapply(r_files, source)

You may have noticed that in all of our examples, the functions passed to lapply, vapply, and sapply have taken just one argument. There is a limitation in these functions in that you can only pass one vectorized argument (more on how to circumvent that later), but you can pass other scalar arguments to the function. To do this, just pass in named arguments to the lapply (or sapply or vapply) call, and they will be passed to the inner function. For example, if rep.int takes two arguments, but the times argument is allowed to be a single (scalar) number, you’d type:

complemented <- c(2, 3, 6, 18)        #See http://oeis.org/A000614
lapply(complemented, rep.int, times = 4)
## [[1]]
## [1] 2 2 2 2
##
## [[2]]
## [1] 3 3 3 3
##
## [[3]]
## [1] 6 6 6 6
##
## [[4]]
## [1] 18 18 18 18

What if the vector argument isn’t the first one? In that case, we have to create our own function to wrap the function that we really wanted to call. You can do this on a separate line, but it is common to include the function definition within the call to lapply:

rep4x <- function(x) rep.int(4, times = x)
lapply(complemented, rep4x)
## [[1]]
## [1] 4 4
##
## [[2]]
## [1] 4 4 4
##
## [[3]]
## [1] 4 4 4 4 4 4
##
## [[4]]
##  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

This last code chunk can be made a little simpler by passing an anonymous function to lapply. This is the trick we saw in Chapter 5, where we don’t bother with a separate assignment line and just pass the function to lapply without giving it a name:

lapply(complemented, function(x) rep.int(4, times = x))
## [[1]]
## [1] 4 4
##
## [[2]]
## [1] 4 4 4
##
## [[3]]
## [1] 4 4 4 4 4 4
##
## [[4]]
##  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Very, very occasionally, you may want to loop over every variable in an environment, rather than in a list. There is a dedicated function, eapply, for this, though in recent versions of R you can also use lapply:

env <- new.env()
env$molien <- c(1, 0, 1, 0, 1, 1, 2, 1, 3) #See http://oeis.org/A008584
env$larry <- c("Really", "leery", "rarely", "Larry")
eapply(env, length)
## $molien
## [1] 9
##
## $larry
## [1] 4
lapply(env, length) #same
## $molien
## [1] 9
##
## $larry
## [1] 4

rapply is a recursive version of lapply that allows you to loop over nested lists. This is a niche requirement, and code is often simpler if you flatten the data first using unlist.

Looping Over Arrays

lapply, and its friends vapply and sapply, can be used on matrices and arrays, but their behavior often isn’t what we want. The three functions treat the matrices and arrays as though they were vectors, applying the target function to each element one at a time (moving down columns). More commonly, when we want to apply a function to an array, we want to apply it by row or by column. This next example uses the matlab package, which gives some functionality ported from the rival language.

To run the next example, you first need to install the matlab package:

install.packages("matlab")
library(matlab)
## Attaching package: 'matlab'
## The following object is masked from 'package:stats':
##
## reshape
## The following object is masked from 'package:utils':
##
## find, fix
## The following object is masked from 'package:base':
##
## sum

Note

When you load the matlab package, it overrides some functions in the base, stats, and utils packages to make them behave like their MATLAB counterparts. After these examples that use the matlab package, you may wish to restore the usual behavior by unloading the package. Call detach("package:matlab") to do this.

The magic function creates a magic square—an n-by-n square matrix of the numbers from 1 to n^2, where each row and each column has the same total:

(magic4 <- magic(4))
##      [,1] [,2] [,3] [,4]
## [1,]   16    2    3   13
## [2,]    5   11   10    8
## [3,]    9    7    6   12
## [4,]    4   14   15    1

A classic problem requiring us to apply a function by row is calculating the row totals. This can be achieved using the rowSums function that we saw briefly in Chapter 5:

rowSums(magic4)
## [1] 34 34 34 34

But what if we want to calculate a different statistic for each row? It would be cumbersome to try to provide a function for every such possibility.[27] The apply function provides the row/column-wise equivalent of lapply, taking a matrix, a dimension number, and a function as arguments. The dimension number is 1 for “apply the function across each row,” or 2 for “apply the function down each column” (or bigger numbers for higher-dimensional arrays):

apply(magic4, 1, sum)      #same as rowSums
## [1] 34 34 34 34
apply(magic4, 1, toString)
## [1] "16, 2, 3, 13" "5, 11, 10, 8" "9, 7, 6, 12"  "4, 14, 15, 1"
apply(magic4, 2, toString)
## [1] "16, 5, 9, 4"  "2, 11, 7, 14" "3, 10, 6, 15" "13, 8, 12, 1"

apply can also be used on data frames, though the mixed-data-type nature means that this is less common (for example, you can’t sensibly calculate a sum or a product when there are character columns):

(baldwins <- data.frame(
  name             = c("Alec", "Daniel", "Billy", "Stephen"),
  date_of_birth    = c(
    "1958-Apr-03", "1960-Oct-05", "1963-Feb-21", "1966-May-12"
  ),
  n_spouses        = c(2, 3, 1, 1),
  n_children       = c(1, 5, 3, 2),
  stringsAsFactors = FALSE
))
##      name date_of_birth n_spouses n_children
## 1    Alec   1958-Apr-03         2          1
## 2  Daniel   1960-Oct-05         3          5
## 3   Billy   1963-Feb-21         1          3
## 4 Stephen   1966-May-12         1          2
apply(baldwins, 1, toString)
## [1] "Alec, 1958-Apr-03, 2, 1"    "Daniel, 1960-Oct-05, 3, 5"
## [3] "Billy, 1963-Feb-21, 1, 3"   "Stephen, 1966-May-12, 1, 2"
apply(baldwins, 2, toString)
##                                                 name
##                       "Alec, Daniel, Billy, Stephen"
##                                        date_of_birth
## "1958-Apr-03, 1960-Oct-05, 1963-Feb-21, 1966-May-12"
##                                            n_spouses
##                                         "2, 3, 1, 1"
##                                           n_children
##                                         "1, 5, 3, 2"

When applied to a data frame by column, apply behaves identically to sapply (remember that data frames can be thought of as nonnested lists where the elements are of the same length):

sapply(baldwins, toString)
##                                                 name
##                       "Alec, Daniel, Billy, Stephen"
##                                        date_of_birth
## "1958-Apr-03, 1960-Oct-05, 1963-Feb-21, 1966-May-12"
##                                            n_spouses
##                                         "2, 3, 1, 1"
##                                           n_children
##                                         "1, 5, 3, 2"

Of course, simply printing a dataset in different forms isn’t that interesting. Using sapply combined with range, on the other hand, is a great way to quickly determine the extent of your data:

sapply(baldwins, range)
##      name      date_of_birth n_spouses n_children
## [1,] "Alec"    "1958-Apr-03" "1"       "1"
## [2,] "Stephen" "1966-May-12" "3"       "5"

Multiple-Input Apply

One of the drawbacks of lapply is that it only accepts a single vector to loop over. Another is that inside the function that is called on each element, you don’t have access to the name of that element.

The function mapply, short for “multiple argument list apply,” lets you pass in as many vectors as you like, solving the first problem. A common usage is to pass in a list in one argument and the names of that list in another, solving the second problem. One little annoyance is that in order to accommodate an arbitrary number of vector arguments, the order of the arguments has been changed. For mapply, the function is passed as the first argument:

msg <- function(name, factors)
{
  ifelse(
    length(factors) == 1,
    paste(name, "is prime"),
    paste(name, "has factors", toString(factors))
  )
}
mapply(msg, names(prime_factors), prime_factors)
##                         two                       three
##              "two is prime"            "three is prime"
##                        four                        five
##     "four has factors 2, 2"             "five is prime"
##                         six                       seven
##      "six has factors 2, 3"            "seven is prime"
##                       eight                        nine
## "eight has factors 2, 2, 2"     "nine has factors 3, 3"
##                         ten
##      "ten has factors 2, 5"

By default, mapply behaves in the same way as sapply, simplifying the output if it thinks it can. You can turn this behavior off (so it behaves more like lapply) by passing the argument SIMPLIFY = FALSE.

Instant Vectorization

The function Vectorize is a wrapper to mapply that takes a function that usually accepts a scalar input, and returns a new function that accepts vectors. This next function is not vectorized because of its use of switch, which requires a scalar input:

baby_gender_report <- function(gender)
{
  switch(
    gender,
    male   = "It's a boy!",
    female = "It's a girl!",
    "Um..."
  )
}

If we pass a vector into the function, it will throw an error:

genders <- c("male", "female", "other")
baby_gender_report(genders)

While it is theoretically possible to do a complete rewrite of a function that is inherently vectorized, it is easier to use the Vectorize function:

vectorized_baby_gender_report <- Vectorize(baby_gender_report)
vectorized_baby_gender_report(genders)
##           male         female          other
##  "It's a boy!" "It's a girl!"        "Um..."

Split-Apply-Combine

A really common problem when investigating data is how to calculate some statistic on a variable that has been split into groups. Here are some scores on the classic road safety awareness computer game, Frogger:

(frogger_scores <- data.frame(
  player = rep(c("Tom", "Dick", "Harry"), times = c(2, 5, 3)),
  score  = round(rlnorm(10, 8), -1)
))
##    player score
## 1     Tom  2250
## 2     Tom  1510
## 3    Dick  1700
## 4    Dick   410
## 5    Dick  3720
## 6    Dick  1510
## 7    Dick  4500
## 8   Harry  2160
## 9   Harry  5070
## 10  Harry  2930

If we want to calculate the mean score for each player, then there are three steps. First, we split the dataset by player:

(scores_by_player <- with(
  frogger_scores,
  split(score, player)
))
## $Dick
## [1] 1700  410 3720 1510 4500
##
## $Harry
## [1] 2160 5070 2930
##
## $Tom
## [1] 2250 1510

Next we apply the (mean) function to each element:

(list_of_means_by_player <- lapply(scores_by_player, mean))
## $Dick
## [1] 2368
##
## $Harry
## [1] 3387
##
## $Tom
## [1] 1880

Finally, we combine the result into a single vector:

(mean_by_player <- unlist(list_of_means_by_player))
##  Dick Harry   Tom
##  2368  3387  1880

The last two steps can be condensed into one by using vapply or sapply, but split-apply-combine is such a common task that we need something easier. That something is the tapply function, which performs all three steps in one go:

with(frogger_scores, tapply(score, player, mean))
##  Dick Harry   Tom
##  2368  3387  1880

There are a few other wrapper functions to tapply, namely by and aggregate. They perform the same function with a slightly different interface.

Tip

SQL fans may note that split-apply-combine is the same as a GROUP BY operation.

The plyr Package

The *apply family of functions are mostly wonderful, but they have three drawbacks that stop them being as easy to use as they could be. Firstly, the names are a bit obscure. The “l” in lapply for lists makes sense, but after using R for nine years, I still don’t know what the “t” in tapply stands for.

Secondly, the arguments aren’t entirely consistent. Most of the functions take a data object first and a function argument second, but mapply swaps the order, and tapply takes the function for its third argument. The data argument is sometimes X and sometimes object, and the simplification argument is sometimes simplify and sometimes SIMPLIFY.

Thirdly, the form of the output isn’t as controllable as it could be. Getting your results as a data frame—or discarding the result—takes a little bit of effort.

This is where the plyr package comes in handy. The package contains a set of functions named **ply, where the blanks (asterisks) denote the form of the input and output, respectively. So, llply takes a list input, applies a function to each element, and returns a list, making it a drop-in replacement for lapply:

library(plyr)
llply(prime_factors, unique)
## $two
## [1] 2
##
## $three
## [1] 3
##
## $four
## [1] 2
##
## $five
## [1] 5
##
## $six
## [1] 2 3
##
## $seven
## [1] 7
##
## $eight
## [1] 2
##
## $nine
## [1] 3
##
## $ten
## [1] 2 5

laply takes a list and returns an array, mimicking sapply. In the case of an empty input, it does the smart thing and returns an empty logical vector (unlike sapply, which returns an empty list):

laply(prime_factors, length)
## [1] 1 1 2 1 2 1 3 2 2
laply(list(), length)
## logical(0)

raply replaces replicate (not rapply!), but there are also rlply and rdply functions that let you return the result in list or data frame form, and an r_ply function that discards the result (useful for drawing plots):

raply(5, runif(1))  #array output
## [1] 0.009415 0.226514 0.133015 0.698586 0.112846
rlply(5, runif(1))  #list output
## [[1]]
## [1] 0.6646
##
## [[2]]
## [1] 0.2304
##
## [[3]]
## [1] 0.613
##
## [[4]]
## [1] 0.5532
##
## [[5]]
## [1] 0.3654
rdply(5, runif(1))  #data frame output
##   .n     V1
## 1  1 0.9068
## 2  2 0.0654
## 3  3 0.3788
## 4  4 0.5086
## 5  5 0.3502
r_ply(5, runif(1))  #discarded output
## NULL

Perhaps the most commonly used function in plyr is ddply, which takes data frames as inputs and outputs and can be used as a replacement for tapply. Its big strength is that it makes it easy to make calculations on several columns at once. Let’s add a level column to the Frogger dataset, denoting the level the player reached in the game:

frogger_scores$level <- floor(log(frogger_scores$score))

There are several different ways of calling ddply. All methods take a data frame, the name of the column(s) to split by, and the function to apply to each piece. The column is passed without quotes, but wrapped in a call to the . function.

For the function, you can either use colwise to tell ddply to call the function on every column (that you didn’t mention in the second argument), or use summarize and specify manipulations of specific columns:

ddply(
  frogger_scores,
  .(player),
  colwise(mean) #call mean on every column except player
)
##   player score level
## 1   Dick  2368 7.200
## 2  Harry  3387 7.333
## 3    Tom  1880 7.000
ddply(
  frogger_scores,
  .(player),
  summarize,
  mean_score = mean(score), #call mean on score
  max_level  = max(level)   #... and max on level
)
##   player mean_score max_level
## 1   Dick       2368         8
## 2  Harry       3387         8
## 3    Tom       1880         7

colwise is quicker to specify, but you have to do the same thing with each column, whereas summarize is more flexible but requires more typing.

There is no direct replacement for mapply, though the m*ply functions allow looping with multiple arguments. Likewise, there is no replacement for vapply or rapply.

Summary

  • The apply family of functions provide cleaner code for looping.
  • Split-apply-combine problems, where you manipulate data split into groups, are really common.
  • The plyr package is a syntactically cleaner replacement for many apply functions.

Test Your Knowledge: Quiz

Question 9-1
Name as many members of the apply family of functions as you can.
Question 9-2
What is the difference between lapply, vapply, and sapply?
Question 9-3
How might you loop over tree-like data?
Question 9-4
Given some height data, how might you calculate a mean height by gender?
Question 9-5
In the plyr package, what do the asterisks mean in a name like **ply?

Test Your Knowledge: Exercises

Exercise 9-1

Loop over the list of children in the celebrity Wayans family. How many children does each of the first generation of Wayanses have?

wayans <- list(
  "Dwayne Kim" = list(),
  "Keenen Ivory" = list(
     "Jolie Ivory Imani",
     "Nala",
     "Keenen Ivory Jr",
     "Bella",
     "Daphne Ivory"
  ),
  Damon = list(
    "Damon Jr",
    "Michael",
    "Cara Mia",
    "Kyla"
  ),
  Kim = list(),
  Shawn = list(
    "Laila",
    "Illia",
    "Marlon"
  ),
  Marlon = list(
    "Shawn Howell",
    "Arnai Zachary"
  ),
  Nadia = list(),
  Elvira = list(
    "Damien",
    "Chaunté"
  ),
  Diedre = list(
    "Craig",
    "Gregg",
    "Summer",
    "Justin",
    "Jamel"
  ),
  Vonnie = list()
)

[5]

Exercise 9-2

state.x77 is a dataset that is supplied with R. It contains information about the population, income, and other factors for each US state. You can see its values by typing its name, just as you would with datasets that you create yourself:

state.x77
  1. Inspect the dataset using the method that you learned in Chapter 3.
  2. Find the mean and standard deviation of each column.

    [10]

Exercise 9-3

Recall the time_for_commute function from earlier in the chapter. Calculate the 75th-percentile commute time by mode of transport:

commute_times <- replicate(1000, time_for_commute())
commute_data <- data.frame(
  time = commute_times,
  mode = names(commute_times)
)

[5]



[25] Lognormal distributions occasionally throw out very big numbers, thus approximating rush hour gridlock.

[26] Since the vectorization happens at the R level rather than by calling internal C code, you don’t get the performance benefits of the vectorization, only more readable code.

[27] Though the matrixStats package tries to do exactly that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset