R’s looping capability goes far beyond the three standard-issue loops seen in the last chapter. It gives you the ability to apply functions to each element of a vector, list, or array, so you can write pseudo-vectorized code where normal vectorization isn’t possible. Other loops let you calculate summary statistics on chunks of data.
After reading this chapter, you should:
plyr
package
Cast your mind back to Chapter 4 and the rep
function. rep
repeats its input several times. Another related function, replicate
, calls an expression several times. Mostly, they do exactly the same thing. The difference occurs when random number generation is involved. Pretend for a moment that the uniform random number generation function, runif
, isn’t vectorized. rep
will repeat the same random number several times, but replicate
gives a different number each time (for historical reasons, the order of the arguments is annoyingly back to front):
rep(
runif(
1
),
5
)
## [1] 0.04573 0.04573 0.04573 0.04573 0.04573
replicate(
5
,
runif(
1
))
## [1] 0.5839 0.3689 0.1601 0.9176 0.5388
replicate
comes into its own in more complicated examples: its main use is in Monte Carlo analyses, where you repeat an analysis a known number of times, and each iteration is independent of the others.
This next example estimates a person’s time to commute to work via different methods of transport. It’s a little bit complicated, but that’s on purpose because that’s when replicate
is most useful.
The time_for_commute
function uses sample
to randomly pick a mode of transport (car, bus, train, or bike), then uses rnorm
or rlnorm
to find a normally or lognormally[25] distributed travel time (with parameters that depend upon the mode of transport):
time_for_commute<-
function
()
{
#Choose a mode of transport for the day
mode_of_transport<-
sample(
c(
"car"
,
"bus"
,
"train"
,
"bike"
),
size=
1
,
prob=
c(
0.1
,
0.2
,
0.3
,
0.4
)
)
#Find the time to travel, depending upon mode of transport
time<-
switch(
mode_of_transport,
car=
rlnorm(
1
,
log(
30
),
0.5
),
bus=
rlnorm(
1
,
log(
40
),
0.5
),
train=
rnorm(
1
,
30
,
10
),
bike=
rnorm(
1
,
60
,
5
)
)
names(
time)
<-
mode_of_transport time}
The presence of a switch
statement makes this function very hard to vectorize. That means that to find the distribution of commuting times, we need to repeatedly call time_for_commute
to generate data for each day. replicate
gives us instant vectorization:
replicate(
5
,
time_for_commute())
## bike car train bus bike ## 66.22 35.98 27.30 39.40 53.81
By now, you should have noticed that an awful lot of R is vectorized. In fact, your default stance should be to write vectorized code. It’s often cleaner to read, and invariably gives you performance benefits when compared to a loop. In some cases, though, trying to achieve vectorization means contorting your code in unnatural ways. In those cases, the apply
family of functions can give you pretend vectorization,[26] without the pain.
The simplest and most commonly used family member is lapply
, short for “list apply.” lapply
takes a list and a function as inputs, applies the function to each element of the list in turn, and returns another list of results. Recall our prime factorization list from Chapter 5:
prime_factors<-
list(
two=
2
,
three=
3
,
four=
c(
2
,
2
),
five=
5
,
six=
c(
2
,
3
),
seven=
7
,
eight=
c(
2
,
2
,
2
),
nine=
c(
3
,
3
),
ten=
c(
2
,
5
)
)
head(
prime_factors)
## $two ## [1] 2 ## ## $three ## [1] 3 ## ## $four ## [1] 2 2 ## ## $five ## [1] 5 ## ## $six ## [1] 2 3 ## ## $seven ## [1] 7
Trying to find the unique value in each list element is difficult to do in a vectorized way. We could write a for
loop to examine each element, but that’s a little bit clunky:
unique_primes<-
vector(
"list"
,
length(
prime_factors))
for
(
i in seq_along(
prime_factors))
{
unique_primes[[
i]]
<-
unique(
prime_factors[[
i]])
}
names(
unique_primes)
<-
names(
prime_factors)
unique_primes
## $two ## [1] 2 ## ## $three ## [1] 3 ## ## $four ## [1] 2 ## ## $five ## [1] 5 ## ## $six ## [1] 2 3 ## ## $seven ## [1] 7 ## ## $eight ## [1] 2 ## ## $nine ## [1] 3 ## ## $ten ## [1] 2 5
lapply
makes this so much easier, eliminating the nasty boilerplate code for worrying about lengths and names:
lapply(
prime_factors,
unique)
## $two ## [1] 2 ## ## $three ## [1] 3 ## ## $four ## [1] 2 ## ## $five ## [1] 5 ## ## $six ## [1] 2 3 ## ## $seven ## [1] 7 ## ## $eight ## [1] 2 ## ## $nine ## [1] 3 ## ## $ten ## [1] 2 5
When the return value from the function is the same size each time, and you know what that size is, you can use a variant of lapply
called vapply
. vapply
stands for “list apply that returns a vector.” As before, you pass it a list and a function, but vapply
takes a third argument that is a template for the return values. Rather than returning a list, it simplifies the result to be a vector or an array:
vapply(
prime_factors,
length,
numeric(
1
))
## two three four five six seven eight nine ten ## 1 1 2 1 2 1 3 2 2
If the output does not fit the template, then vapply
will throw an error. This makes it less flexible than lapply
, since the output must be the same size for each element and must be known in advance.
There is another function that lies in between lapply
and vapply
: namely sapply
, which stands for “simplifying list apply.” Like the two other functions, sapply
takes a list and a function as inputs. It does not need a template, but will try to simplify the result to an appropriate vector or array if it can:
sapply(
prime_factors,
unique)
#returns a list
## $two ## [1] 2 ## ## $three ## [1] 3 ## ## $four ## [1] 2 ## ## $five ## [1] 5 ## ## $six ## [1] 2 3 ## ## $seven ## [1] 7 ## ## $eight ## [1] 2 ## ## $nine ## [1] 3 ## ## $ten ## [1] 2 5
sapply(
prime_factors,
length)
#returns a vector
## two three four five six seven eight nine ten ## 1 1 2 1 2 1 3 2 2
sapply(
prime_factors,
summary)
#returns an array
## two three four five six seven eight nine ten ## Min. 2 3 2 5 2.00 7 2 3 2.00 ## 1st Qu. 2 3 2 5 2.25 7 2 3 2.75 ## Median 2 3 2 5 2.50 7 2 3 3.50 ## Mean 2 3 2 5 2.50 7 2 3 3.50 ## 3rd Qu. 2 3 2 5 2.75 7 2 3 4.25 ## Max. 2 3 2 5 3.00 7 2 3 5.00
For interactive use, this is wonderful because you usually automatically get the result in the form that you want. This function does require some care if you aren’t sure about what your inputs might be, though, since the result is sometimes a list and sometimes a vector. This can trip you up in some subtle ways. Our previous length
example returned a vector, but look what happens when you pass it an empty list:
sapply(
list(),
length)
## list()
If the input list has length zero, then sapply
always returns a list, regardless of the function that is passed. So if your data could be empty, and you know the return value, it is safer to use vapply
:
vapply(
list(),
length,
numeric(
1
))
## numeric(0)
Although these functions are primarily designed for use with lists, they can also accept vector inputs. In this case, the function is applied to each element of the vector in turn. The source
function is used to read and evaluate the contents of an R file. (That is, you can use it to run an R script.) Unfortunately it isn’t vectorized, so if we wanted to run all the R scripts in a directory, then we need to wrap the directory in a call to lapply
.
In this next example, dir
returns the names of files in a given directory, defaulting to the current working directory. (Recall that you can find this with getwd
.) The argument pattern = "\.R$"
means “only return filenames that end with .R”:
r_files<-
dir(
pattern=
"\.R$"
)
lapply(
r_files,
source)
You may have noticed that in all of our examples, the functions passed to lapply
, vapply
, and sapply
have taken just one argument. There is a limitation in these functions in that you can only pass one vectorized argument (more on how to circumvent that later), but you can pass other scalar arguments to the function. To do this, just pass in named arguments to the lapply
(or sapply
or vapply
) call, and they will be passed to the inner function. For example, if rep.int
takes two arguments, but the times
argument is allowed to be a single (scalar) number, you’d type:
complemented<-
c(
2
,
3
,
6
,
18
)
#See http://oeis.org/A000614
lapply(
complemented,
rep.int,
times=
4
)
## [[1]] ## [1] 2 2 2 2 ## ## [[2]] ## [1] 3 3 3 3 ## ## [[3]] ## [1] 6 6 6 6 ## ## [[4]] ## [1] 18 18 18 18
What if the vector argument isn’t the first one? In that case, we have to create our own function to wrap the function that we really wanted to call. You can do this on a separate line, but it is common to include the function definition within the call to lapply
:
rep4x<-
function
(
x)
rep.int(
4
,
times=
x)
lapply(
complemented,
rep4x)
## [[1]] ## [1] 4 4 ## ## [[2]] ## [1] 4 4 4 ## ## [[3]] ## [1] 4 4 4 4 4 4 ## ## [[4]] ## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
This last code chunk can be made a little simpler by passing an anonymous function to lapply
. This is the trick we saw in Chapter 5, where we don’t bother with a separate assignment line and just pass the function to lapply
without giving it a name:
lapply(
complemented,
function
(
x)
rep.int(
4
,
times=
x))
## [[1]] ## [1] 4 4 ## ## [[2]] ## [1] 4 4 4 ## ## [[3]] ## [1] 4 4 4 4 4 4 ## ## [[4]] ## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Very, very occasionally, you may want to loop over every variable in an environment, rather than in a list. There is a dedicated function, eapply
, for this, though in recent versions of R you can also use lapply
:
env<-
new.env()
env$
molien<-
c(
1
,
0
,
1
,
0
,
1
,
1
,
2
,
1
,
3
)
#See http://oeis.org/A008584
env$
larry<-
c(
"Really"
,
"leery"
,
"rarely"
,
"Larry"
)
eapply(
env,
length)
## $molien ## [1] 9 ## ## $larry ## [1] 4
lapply(
env,
length)
#same
## $molien ## [1] 9 ## ## $larry ## [1] 4
rapply
is a recursive version of lapply
that allows you to loop over nested lists. This is a niche requirement, and code is often simpler if you flatten the data first using unlist
.
lapply
, and its friends vapply
and sapply
, can be used on matrices and arrays, but their behavior often isn’t what we want. The three functions treat the matrices and arrays as though they were vectors, applying the target function to each element one at a time (moving down columns). More commonly, when we want to apply a function to an array, we want to apply it by row or by column. This next example uses the matlab
package, which gives some functionality ported from the rival language.
To run the next example, you first need to install the matlab
package:
install.packages(
"matlab"
)
library(
matlab)
## Attaching package: 'matlab'
## The following object is masked from 'package:stats': ## ## reshape
## The following object is masked from 'package:utils': ## ## find, fix
## The following object is masked from 'package:base': ## ## sum
When you load the matlab
package, it overrides some functions in the base
, stats
, and utils
packages to make them behave like their MATLAB counterparts. After these examples that use the matlab
package, you may wish to restore the usual behavior by unloading the package. Call detach("package:matlab")
to do this.
The magic
function creates a magic square—an n-by-n square matrix of the numbers from 1 to n^2, where each row and each column has the same total:
(
magic4<-
magic(
4
))
## [,1] [,2] [,3] [,4] ## [1,] 16 2 3 13 ## [2,] 5 11 10 8 ## [3,] 9 7 6 12 ## [4,] 4 14 15 1
A classic problem requiring us to apply a function by row is calculating the row totals. This can be achieved using the rowSums
function that we saw briefly in Chapter 5:
rowSums(
magic4)
## [1] 34 34 34 34
But what if we want to calculate a different statistic for each row? It would be cumbersome to try to provide a function for every such possibility.[27] The apply
function provides the row/column-wise equivalent of lapply
, taking a matrix, a dimension number, and a function as arguments. The dimension number is 1
for “apply the function across each row,” or 2
for “apply the function down each column” (or bigger numbers for higher-dimensional arrays):
apply(
magic4,
1
,
sum)
#same as rowSums
## [1] 34 34 34 34
apply(
magic4,
1
,
toString)
## [1] "16, 2, 3, 13" "5, 11, 10, 8" "9, 7, 6, 12" "4, 14, 15, 1"
apply(
magic4,
2
,
toString)
## [1] "16, 5, 9, 4" "2, 11, 7, 14" "3, 10, 6, 15" "13, 8, 12, 1"
apply
can also be used on data frames, though the mixed-data-type nature means that this is less common (for example, you can’t sensibly calculate a sum or a product when there are character columns):
(
baldwins<-
data.frame(
name=
c(
"Alec"
,
"Daniel"
,
"Billy"
,
"Stephen"
),
date_of_birth=
c(
"1958-Apr-03"
,
"1960-Oct-05"
,
"1963-Feb-21"
,
"1966-May-12"
),
n_spouses=
c(
2
,
3
,
1
,
1
),
n_children=
c(
1
,
5
,
3
,
2
),
stringsAsFactors=
FALSE
))
## name date_of_birth n_spouses n_children ## 1 Alec 1958-Apr-03 2 1 ## 2 Daniel 1960-Oct-05 3 5 ## 3 Billy 1963-Feb-21 1 3 ## 4 Stephen 1966-May-12 1 2
apply(
baldwins,
1
,
toString)
## [1] "Alec, 1958-Apr-03, 2, 1" "Daniel, 1960-Oct-05, 3, 5" ## [3] "Billy, 1963-Feb-21, 1, 3" "Stephen, 1966-May-12, 1, 2"
apply(
baldwins,
2
,
toString)
## name ## "Alec, Daniel, Billy, Stephen" ## date_of_birth ## "1958-Apr-03, 1960-Oct-05, 1963-Feb-21, 1966-May-12" ## n_spouses ## "2, 3, 1, 1" ## n_children ## "1, 5, 3, 2"
When applied to a data frame by column, apply
behaves identically to sapply
(remember that data frames can be thought of as nonnested lists where the elements are of the same length):
sapply(
baldwins,
toString)
## name ## "Alec, Daniel, Billy, Stephen" ## date_of_birth ## "1958-Apr-03, 1960-Oct-05, 1963-Feb-21, 1966-May-12" ## n_spouses ## "2, 3, 1, 1" ## n_children ## "1, 5, 3, 2"
Of course, simply printing a dataset in different forms isn’t that interesting. Using sapply
combined with range
, on the other hand, is a great way to quickly determine the extent of your data:
sapply(
baldwins,
range)
## name date_of_birth n_spouses n_children ## [1,] "Alec" "1958-Apr-03" "1" "1" ## [2,] "Stephen" "1966-May-12" "3" "5"
One of the drawbacks of lapply
is that it only accepts a single vector to loop over. Another is that inside the function that is called on each element, you don’t have access to the name of that element.
The function mapply
, short for “multiple argument list apply,” lets you pass in as many vectors as you like, solving the first problem. A common usage is to pass in a list in one argument and the names of that list in another, solving the second problem. One little annoyance is that in order to accommodate an arbitrary number of vector arguments, the order of the arguments has been changed. For mapply
, the function is passed as the first argument:
msg<-
function
(
name,
factors)
{
ifelse(
length(
factors)
==
1
,
paste(
name,
"is prime"
),
paste(
name,
"has factors"
,
toString(
factors))
)
}
mapply(
msg,
names(
prime_factors),
prime_factors)
## two three ## "two is prime" "three is prime" ## four five ## "four has factors 2, 2" "five is prime" ## six seven ## "six has factors 2, 3" "seven is prime" ## eight nine ## "eight has factors 2, 2, 2" "nine has factors 3, 3" ## ten ## "ten has factors 2, 5"
By default, mapply
behaves in the same way as sapply
, simplifying the output if it thinks it can. You can turn this behavior off (so it behaves more like lapply
) by passing the argument SIMPLIFY = FALSE
.
The function Vectorize
is a wrapper to mapply
that takes a function that usually accepts a scalar input, and returns a new function that accepts vectors. This next function is not vectorized because of its use of switch
, which requires a scalar input:
baby_gender_report<-
function
(
gender)
{
switch(
gender,
male=
"It's a boy!"
,
female=
"It's a girl!"
,
"Um..."
)
}
If we pass a vector into the function, it will throw an error:
genders<-
c(
"male"
,
"female"
,
"other"
)
baby_gender_report(
genders)
While it is theoretically possible to do a complete rewrite of a function that is inherently vectorized, it is easier to use the Vectorize
function:
vectorized_baby_gender_report<-
Vectorize(
baby_gender_report)
vectorized_baby_gender_report(
genders)
## male female other ## "It's a boy!" "It's a girl!" "Um..."
A really common problem when investigating data is how to calculate some statistic on a variable that has been split into groups. Here are some scores on the classic road safety awareness computer game, Frogger:
(
frogger_scores<-
data.frame(
player=
rep(
c(
"Tom"
,
"Dick"
,
"Harry"
),
times=
c(
2
,
5
,
3
)),
score=
round(
rlnorm(
10
,
8
),
-1
)
))
## player score ## 1 Tom 2250 ## 2 Tom 1510 ## 3 Dick 1700 ## 4 Dick 410 ## 5 Dick 3720 ## 6 Dick 1510 ## 7 Dick 4500 ## 8 Harry 2160 ## 9 Harry 5070 ## 10 Harry 2930
If we want to calculate the mean score for each player, then there are three steps. First, we split the dataset by player:
(
scores_by_player<-
with(
frogger_scores,
split(
score,
player)
))
## $Dick ## [1] 1700 410 3720 1510 4500 ## ## $Harry ## [1] 2160 5070 2930 ## ## $Tom ## [1] 2250 1510
Next we apply the (mean
) function to each element:
(
list_of_means_by_player<-
lapply(
scores_by_player,
mean))
## $Dick ## [1] 2368 ## ## $Harry ## [1] 3387 ## ## $Tom ## [1] 1880
Finally, we combine the result into a single vector:
(
mean_by_player<-
unlist(
list_of_means_by_player))
## Dick Harry Tom ## 2368 3387 1880
The last two steps can be condensed into one by using vapply
or sapply
, but split-apply-combine is such a common task that we need something easier. That something is the tapply
function, which performs all three steps in one go:
with(
frogger_scores,
tapply(
score,
player,
mean))
## Dick Harry Tom ## 2368 3387 1880
There are a few other wrapper functions to tapply
, namely by
and aggregate
. They perform the same function with a slightly different interface.
The *apply
family of functions are mostly wonderful, but they have three drawbacks that stop them being as easy to use as they could be. Firstly, the names are a bit obscure. The “l” in lapply
for lists makes sense, but after using R for nine years, I still don’t know what the “t” in tapply
stands for.
Secondly, the arguments aren’t entirely consistent. Most of the functions take a data object first and a function argument second, but mapply
swaps the order, and tapply
takes the function for its third argument. The data argument is sometimes X
and sometimes object
, and the simplification argument is sometimes simplify
and sometimes SIMPLIFY
.
Thirdly, the form of the output isn’t as controllable as it could be. Getting your results as a data frame—or discarding the result—takes a little bit of effort.
This is where the plyr
package comes in handy. The package contains a set of functions named **ply
, where the blanks (asterisks) denote the form of the input and output, respectively. So, llply
takes a list input, applies a function to each element, and returns a list, making it a drop-in replacement for lapply
:
library(
plyr)
llply(
prime_factors,
unique)
## $two ## [1] 2 ## ## $three ## [1] 3 ## ## $four ## [1] 2 ## ## $five ## [1] 5 ## ## $six ## [1] 2 3 ## ## $seven ## [1] 7 ## ## $eight ## [1] 2 ## ## $nine ## [1] 3 ## ## $ten ## [1] 2 5
laply
takes a list and returns an array, mimicking sapply
. In the case of an empty input, it does the smart thing and returns an empty logical vector (unlike sapply
, which returns an empty list):
laply(
prime_factors,
length)
## [1] 1 1 2 1 2 1 3 2 2
laply(
list(),
length)
## logical(0)
raply
replaces replicate
(not rapply
!), but there are also rlply
and rdply
functions that let you return the result in list or data frame form, and an r_ply
function that discards the result (useful for drawing plots):
raply(
5
,
runif(
1
))
#array output
## [1] 0.009415 0.226514 0.133015 0.698586 0.112846
rlply(
5
,
runif(
1
))
#list output
## [[1]] ## [1] 0.6646 ## ## [[2]] ## [1] 0.2304 ## ## [[3]] ## [1] 0.613 ## ## [[4]] ## [1] 0.5532 ## ## [[5]] ## [1] 0.3654
rdply(
5
,
runif(
1
))
#data frame output
## .n V1 ## 1 1 0.9068 ## 2 2 0.0654 ## 3 3 0.3788 ## 4 4 0.5086 ## 5 5 0.3502
r_ply(
5
,
runif(
1
))
#discarded output
## NULL
Perhaps the most commonly used function in plyr
is ddply
, which takes data frames as inputs and outputs and can be used as a replacement for tapply
. Its big strength is that it makes it easy to make calculations on several columns at once. Let’s add a level
column to the Frogger dataset, denoting the level the player reached in the game:
frogger_scores$
level<-
floor(
log(
frogger_scores$
score))
There are several different ways of calling ddply
. All methods take a data frame, the name of the column(s) to split by, and the function to apply to each piece. The column is passed without quotes, but wrapped in a call to the .
function.
For the function, you can either use colwise
to tell ddply
to call the function on every column (that you didn’t mention in the second argument), or use summarize
and specify manipulations of specific columns:
ddply(
frogger_scores,
.
(
player),
colwise(
mean)
#call mean on every column except player
)
## player score level ## 1 Dick 2368 7.200 ## 2 Harry 3387 7.333 ## 3 Tom 1880 7.000
ddply(
frogger_scores,
.
(
player),
summarize,
mean_score=
mean(
score),
#call mean on score
max_level=
max(
level)
#... and max on level
)
## player mean_score max_level ## 1 Dick 2368 8 ## 2 Harry 3387 8 ## 3 Tom 1880 7
colwise
is quicker to specify, but you have to do the same thing with each column, whereas summarize
is more flexible but requires more typing.
There is no direct replacement for mapply
, though the m*ply
functions allow looping with multiple arguments. Likewise, there is no replacement for vapply
or rapply
.
apply
family of functions as you can.
lapply
, vapply
, and sapply
?
plyr
package, what do the asterisks mean in a name like **ply
?
Loop over the list of children in the celebrity Wayans family. How many children does each of the first generation of Wayanses have?
wayans<-
list(
"Dwayne Kim"
=
list(),
"Keenen Ivory"
=
list(
"Jolie Ivory Imani"
,
"Nala"
,
"Keenen Ivory Jr"
,
"Bella"
,
"Daphne Ivory"
),
Damon=
list(
"Damon Jr"
,
"Michael"
,
"Cara Mia"
,
"Kyla"
),
Kim=
list(),
Shawn=
list(
"Laila"
,
"Illia"
,
"Marlon"
),
Marlon=
list(
"Shawn Howell"
,
"Arnai Zachary"
),
Nadia=
list(),
Elvira=
list(
"Damien"
,
"Chaunté"
),
Diedre=
list(
"Craig"
,
"Gregg"
,
"Summer"
,
"Justin"
,
"Jamel"
),
Vonnie=
list()
)
[5]
state.x77
is a dataset that is supplied with R. It contains information about the population, income, and other factors for each US state. You can see its values by typing its name, just as you would with datasets that you create yourself:
state.x77
Find the mean and standard deviation of each column.
[10]
Recall the time_for_commute
function from earlier in the chapter. Calculate the 75th-percentile commute time by mode of transport:
commute_times<-
replicate(
1000
,
time_for_commute())
commute_data<-
data.frame(
time=
commute_times,
mode=
names(
commute_times)
)
[5]
[25] Lognormal distributions occasionally throw out very big numbers, thus approximating rush hour gridlock.
[26] Since the vectorization happens at the R level rather than by calling internal C code, you don’t get the performance benefits of the vectorization, only more readable code.
[27] Though the matrixStats
package tries to do exactly that.