Using apply-family functions

Previously, we talked about using a for loop to repeat evaluating an expression with an iterator on a vector or list. In practice, however, the for loop is almost the last choice because an alternative way is much cleaner and easier to write and read when each iteration is independent of each other.

For example, the following code uses for to create a list of three independent, normally distributed random vectors whose length is specified by vector len:

len <- c(3, 4, 5)
# first, create a list in the environment.
x <- list()
# then use `for` to generate the random vector for each length
for (i in 1:3) {
  x[[i]] <- rnorm(len[i])
}
x
## [[1]]
## [1] 1.4572245 0.1434679 -0.4228897
##
## [[2]]
## [1] -1.4202269 -0.7162066 -1.6006179 -1.2985130
##
## [[3]]
## [1] -0.6318412  1.6784430  0.1155478  0.2905479 -0.7363817 

The preceding example is simple, but the code is quite redundant compared to the implementation with lapply:

lapply(len, rnorm)
## [[1]]
## [1] -0.3258354 -1.4658116 -0.1461097
##
## [[2]]
## [1] -0.1715198 0.5215857 -0.3178271 -0.3967798
##
## [[3]]
## [1] -0.2047106 -1.2009772  1.4859955  0.1940920  0.3758798 

The lapply version is much simpler. It applies rnorm() on each element in len and puts each result into a list.

From the preceding example, we should realize that it is only possible if R allows us to pass functions as ordinary objects. Fortunately, it is true. Functions in R are treated just like objects and can be passed around as arguments, just as we showed in the section on numeric methods. This feature largely boosts the flexibility of coding.

Each apply-family function is a so-called higher-order function that accepts a function as an argument. We will introduce this concept in detail later.

lapply

The lapply() function, as we previously demonstrated, takes a vector and a function as its arguments. It simply applies the function to each element in the given vector and finally returns a list that contains all the results.

This function is useful when each iteration is independent of the other. In this case, we don't have to explicitly create an iterator.

It works not only with vectors but also with lists. Suppose we have a list of students:

students <- list(
  a1 = list(name = "James", age = 25,
    gender = "M", interest = c("reading", "writing")),
  a2 = list(name = "Jenny", age = 23,
    gender = "F", interest = c("cooking")),
  a3 = list(name = "David", age = 24,
    gender = "M", interest = c("running", "basketball"))) 

Now, we need to create a character vector in which each element is formatted as follows:

James, 25 year-old man, loves reading, writing. 

Note that sprintf() is useful to format text by replacing the placeholders (for example, %s for string, %d for integer) with corresponding input arguments. Here is an example:

sprintf("Hello, %s! Your number is %d.", "Tom", 3)
## [1] "Hello, Tom! Your number is 3." 

Now, first, we are sure that an iteration is working on students, and each is independent. In other words, the computation for James has nothing to do with that for Jenny, and so on. Therefore, we can use lapply to do the work:

lapply(students, function(s) {
  type <- switch(s$gender, "M" = "man", "F" = "woman")
  interest <- paste(s$interest, collapse = ", ")
  sprintf("%s, %d year-old %s, loves %s.", s$name, s$age, type, interest)
})
## $a1
## [1] "James, 25 year-old man, loves reading, writing."
##
## $a2
## [1] "Jenny, 23 year-old woman, loves cooking."
##
## $a3
## [1] "David, 24 year-old man, loves running, basketball." 

The preceding code uses an anonymous function which is a function that is not assigned to a symbol. In other words, the function is only temporal and has no name. Of course, we can explicitly bind the function to a symbol, that is, give it a name, and use that name in lapply.

Despite this, the code is quite straightforward. For each element s in students, the function decides the type of the student and pastes their interests together, separated by commas. It then puts the information in a format we want.

Fortunately, a major part of how we use lapply also works with other apply-family functions, but their iterating mechanism or the type of results may be different.

sapply

List is not always a favorable container for the results. Sometimes, we want them to be put in a simple vector or a matrix. The sapply function simplifies the result according to its structure.

Suppose we apply a square on each element of 1:10. If we do it with lapply, we will have a list of squared numbers. This result looks a bit heavy and redundant because the resulted list is actually a list of single-valued numeric vectors. However, we might want to keep the results still as a vector:

sapply(1:10, function(i) i ^ 2)
##  [1]   1   4   9  16  25  36  49  64  81 100 

If the applying function returns a multi-element vector each time, sapply will put the results into a matrix in which each returned vector occupies a column:

sapply(1:10, function(i) c(i, i ^ 2))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,]    1    4    9   16   25   36   49   64   81   100 

vapply

Although sapply is very handy and smart, the smartness may sometimes become a risk. Suppose we have a list of input numbers:

x <- list(c(1, 2), c(2, 3), c(1, 3)) 

If we want to get a numeric vector of the squared numbers for each number in xsapply can be easy to use because it automatically tries to simplify the data structure of the result:

sapply(x, function(x) x ^ 2)
## [,1] [,2] [,3]
## [1,] 1 4 1
## [2,] 4 9 9 

However, if the input data has some mistakes or corruption, sapply() will silently accept the input and may return an unexpected value. For example, let's assume that the third element of x has mistakenly got an additional element:

x1 <- list(c(1, 2), c(2, 3), c(1, 3, 3)) 

Then, sapply() finds that it can no longer be simplified to a matrix and thus returns a list:

sapply(x1, function(x) x ^ 2)
## [[1]]
## [1] 1 4
##
## [[2]]
## [1] 4 9
##
## [[3]]
## [1] 1 9 9 

If we use vapply() in the first place, the mistake will be spotted very soon. The vapply() function has an additional argument that specifies the template of the returned value from each iteration. In the following code, the template is numeric(2), which means each iteration should return a numeric vector of two elements. If the template is violated, the function will end up in an error:

vapply(x1, function(x) x ^ 2, numeric(2))
## Error in vapply(x1, function(x) x^2, numeric(2)): values must be length 2,
##  but FUN(X[[3]]) result is length 3 

For the original and correct input, vapply() returns exactly the same matrix as sapply() did:

vapply(x, function(x) x ^ 2, numeric(2))
## [,1] [,2] [,3]
## [1,] 1 4 1
## [2,] 4 9 9 

In conclusion, vapply is the safer version of sapply as it performs additional template checking. In practical use, if the template can be determined, it is better to use vapply() than sapply().

mapply

While lappy() and sapply() both iterate over one vector, mapply() iterates over multiple vectors. In other words, mapply is a multivariate version of sapply:

mapply(function(a, b, c) a * b + b * c + a * c,
a = c(1, 2, 3), b = c(5, 6, 7), c = c(-1, -2, -3))
## [1] -1 -4 -9 

The iterating function is allowed to return not only scalar values but multi-element vectors. Then, mapply() will simplify the result, just like sapply() does:

df <- data.frame(x = c(1, 2, 3), y = c(3, 4, 5))
df
## x y
## 1 1 3
## 2 2 4
## 3 3 5
mapply(function(xi, yi) c(xi, yi, xi + yi), df$x, df$y)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 3 4 5
## [3,]    4    6    8 

Map is the multivariate version of lapply and hence, always returns a list:

Map(function(xi, yi) c(xi, yi, xi + yi), df$x, df$y)
## [[1]]
## [1] 1 3 4
##
## [[2]]
## [1] 2 4 6
##
## [[3]]
## [1] 3 5 8 

apply

The apply function applies a function on a given margin or dimension of a given matrix or array. For example, to calculate the sum of each row, which is the first dimension, we need to specify MARGIN = 1 so that sum is applied to a row (numeric vector) sliced from the matrix in each iteration:

mat <- matrix(c(1, 2, 3, 4), nrow = 2)
mat
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
apply(mat, 1, sum)
## [1] 4 6 

To calculate the sum of each column, which is the second dimension, we need to specify MARGIN=2 so that sum is applied to a column sliced from mat in each iteration:

apply(mat, 2, sum)
## [1] 3 7 

The apply function also supports array input and matrix output:

mat2 <- matrix(1:16, nrow = 4)
mat2
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16 

To build a matrix that shows the max and min value for each column, run the following code:

apply(mat2, 2, function(col) c(min = min(col), max = max(col)))
## [,1] [,2] [,3] [,4]
## min 1 5 9 13
## max 4 8 12 16 

To build a matrix that shows the max and min value for each row, run the following code:

apply(mat2, 1, function(col) c(min = min(col), max = max(col)))
## [,1] [,2] [,3] [,4]
## min 1 2 3 4
## max 13 14 15 16
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset