Previously, we talked about using a for
loop to repeat evaluating an expression with an iterator on a vector or list. In practice, however, the for
loop is almost the last choice because an alternative way is much cleaner and easier to write and read when each iteration is independent of each other.
For example, the following code uses for
to create a list of three independent, normally distributed random vectors whose length is specified by vector len
:
len <- c(3, 4, 5) # first, create a list in the environment. x <- list() # then use `for` to generate the random vector for each length for (i in 1:3) { x[[i]] <- rnorm(len[i]) } x ## [[1]] ## [1] 1.4572245 0.1434679 -0.4228897 ## ## [[2]] ## [1] -1.4202269 -0.7162066 -1.6006179 -1.2985130 ## ## [[3]] ## [1] -0.6318412 1.6784430 0.1155478 0.2905479 -0.7363817
The preceding example is simple, but the code is quite redundant compared to the implementation with lapply
:
lapply(len, rnorm) ## [[1]] ## [1] -0.3258354 -1.4658116 -0.1461097 ## ## [[2]] ## [1] -0.1715198 0.5215857 -0.3178271 -0.3967798 ## ## [[3]] ## [1] -0.2047106 -1.2009772 1.4859955 0.1940920 0.3758798
The lapply
version is much simpler. It applies rnorm()
on each element in len
and puts each result into a list.
From the preceding example, we should realize that it is only possible if R allows us to pass functions as ordinary objects. Fortunately, it is true. Functions in R are treated just like objects and can be passed around as arguments, just as we showed in the section on numeric methods. This feature largely boosts the flexibility of coding.
Each apply-family function is a so-called higher-order function that accepts a function as an argument. We will introduce this concept in detail later.
The lapply()
function, as we previously demonstrated, takes a vector and a function as its arguments. It simply applies the function to each element in the given vector and finally returns a list that contains all the results.
This function is useful when each iteration is independent of the other. In this case, we don't have to explicitly create an iterator.
It works not only with vectors but also with lists. Suppose we have a list of students:
students <- list( a1 = list(name = "James", age = 25, gender = "M", interest = c("reading", "writing")), a2 = list(name = "Jenny", age = 23, gender = "F", interest = c("cooking")), a3 = list(name = "David", age = 24, gender = "M", interest = c("running", "basketball")))
Now, we need to create a character vector in which each element is formatted as follows:
James, 25 year-old man, loves reading, writing.
Note that sprintf()
is useful to format text by replacing the placeholders (for example, %s
for string, %d
for integer) with corresponding input arguments. Here is an example:
sprintf("Hello, %s! Your number is %d.", "Tom", 3) ## [1] "Hello, Tom! Your number is 3."
Now, first, we are sure that an iteration is working on students
, and each is independent. In other words, the computation for James has nothing to do with that for Jenny, and so on. Therefore, we can use lapply
to do the work:
lapply(students, function(s) { type <- switch(s$gender, "M" = "man", "F" = "woman") interest <- paste(s$interest, collapse = ", ") sprintf("%s, %d year-old %s, loves %s.", s$name, s$age, type, interest) }) ## $a1 ## [1] "James, 25 year-old man, loves reading, writing." ## ## $a2 ## [1] "Jenny, 23 year-old woman, loves cooking." ## ## $a3 ## [1] "David, 24 year-old man, loves running, basketball."
The preceding code uses an anonymous function which is a function that is not assigned to a symbol. In other words, the function is only temporal and has no name. Of course, we can explicitly bind the function to a symbol, that is, give it a name, and use that name in lapply
.
Despite this, the code is quite straightforward. For each element s
in students
, the function decides the type of the student and pastes their interests together, separated by commas. It then puts the information in a format we want.
Fortunately, a major part of how we use lapply
also works with other apply-family functions, but their iterating mechanism or the type of results may be different.
List is not always a favorable container for the results. Sometimes, we want them to be put in a simple vector or a matrix. The sapply
function simplifies the result according to its structure.
Suppose we apply a square on each element of 1:10
. If we do it with lapply
, we will have a list of squared numbers. This result looks a bit heavy and redundant because the resulted list is actually a list of single-valued numeric vectors. However, we might want to keep the results still as a vector:
sapply(1:10, function(i) i ^ 2) ## [1] 1 4 9 16 25 36 49 64 81 100
If the applying function returns a multi-element vector each time, sapply
will put the results into a matrix in which each returned vector occupies a column:
sapply(1:10, function(i) c(i, i ^ 2)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 2 3 4 5 6 7 8 9 10 ## [2,] 1 4 9 16 25 36 49 64 81 100
Although sapply
is very handy and smart, the smartness may sometimes become a risk. Suppose we have a list of input numbers:
x <- list(c(1, 2), c(2, 3), c(1, 3))
If we want to get a numeric vector of the squared numbers for each number in x
, sapply
can be easy to use because it automatically tries to simplify the data structure of the result:
sapply(x, function(x) x ^ 2) ## [,1] [,2] [,3] ## [1,] 1 4 1 ## [2,] 4 9 9
However, if the input data has some mistakes or corruption, sapply()
will silently accept the input and may return an unexpected value. For example, let's assume that the third element of x
has mistakenly got an additional element:
x1 <- list(c(1, 2), c(2, 3), c(1, 3, 3))
Then, sapply()
finds that it can no longer be simplified to a matrix and thus returns a list:
sapply(x1, function(x) x ^ 2) ## [[1]] ## [1] 1 4 ## ## [[2]] ## [1] 4 9 ## ## [[3]] ## [1] 1 9 9
If we use vapply()
in the first place, the mistake will be spotted very soon. The vapply()
function has an additional argument that specifies the template of the returned value from each iteration. In the following code, the template is numeric(2)
, which means each iteration should return a numeric vector of two elements. If the template is violated, the function will end up in an error:
vapply(x1, function(x) x ^ 2, numeric(2)) ## Error in vapply(x1, function(x) x^2, numeric(2)): values must be length 2, ## but FUN(X[[3]]) result is length 3
For the original and correct input, vapply()
returns exactly the same matrix as sapply()
did:
vapply(x, function(x) x ^ 2, numeric(2)) ## [,1] [,2] [,3] ## [1,] 1 4 1 ## [2,] 4 9 9
In conclusion, vapply
is the safer version of sapply
as it performs additional template checking. In practical use, if the template can be determined, it is better to use vapply()
than sapply()
.
While lappy()
and sapply()
both iterate over one vector, mapply()
iterates over multiple vectors. In other words, mapply
is a multivariate version of sapply
:
mapply(function(a, b, c) a * b + b * c + a * c, a = c(1, 2, 3), b = c(5, 6, 7), c = c(-1, -2, -3)) ## [1] -1 -4 -9
The iterating function is allowed to return not only scalar values but multi-element vectors. Then, mapply()
will simplify the result, just like sapply()
does:
df <- data.frame(x = c(1, 2, 3), y = c(3, 4, 5)) df ## x y ## 1 1 3 ## 2 2 4 ## 3 3 5 mapply(function(xi, yi) c(xi, yi, xi + yi), df$x, df$y) ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 3 4 5 ## [3,] 4 6 8
Map
is the multivariate version of lapply
and hence, always returns a list:
Map(function(xi, yi) c(xi, yi, xi + yi), df$x, df$y) ## [[1]] ## [1] 1 3 4 ## ## [[2]] ## [1] 2 4 6 ## ## [[3]] ## [1] 3 5 8
The apply
function applies a function on a given margin or dimension of a given matrix or array. For example, to calculate the sum of each row, which is the first dimension, we need to specify MARGIN = 1
so that sum
is applied to a row (numeric vector) sliced from the matrix in each iteration:
mat <- matrix(c(1, 2, 3, 4), nrow = 2) mat ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 apply(mat, 1, sum) ## [1] 4 6
To calculate the sum of each column, which is the second dimension, we need to specify MARGIN=2
so that sum
is applied to a column sliced from mat
in each iteration:
apply(mat, 2, sum) ## [1] 3 7
The apply
function also supports array input and matrix output:
mat2 <- matrix(1:16, nrow = 4) mat2 ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16
To build a matrix that shows the max and min value for each column, run the following code:
apply(mat2, 2, function(col) c(min = min(col), max = max(col))) ## [,1] [,2] [,3] [,4] ## min 1 5 9 13 ## max 4 8 12 16
To build a matrix that shows the max and min value for each row, run the following code:
apply(mat2, 1, function(col) c(min = min(col), max = max(col))) ## [,1] [,2] [,3] [,4] ## min 1 2 3 4 ## max 13 14 15 16