What You’ll Learn in This Hour:
How to perform iterative “looping” techniques in R
How to apply functions to complex data structures
How to calculate metrics “by” one or more variables
Throughout this book you have seen how to use, and even create, simple R functions. In this hour, we are going to use simple functions and code in a more “applied” fashion. This allows us to perform tasks repeatedly over sections of our data without the need to produce verbose, repetitive code.
Imagine we want to perform the same task multiple times—for example, on each row of some dataset, df
. We might first create a simple function, performAction
, and then write a verbose R script such as this:
> performAction(df[1,]) # Perform action on first row
> performAction(df[2,]) # Perform action on second row
> performAction(df[3,]) # Perform action on third row
> performAction(df[4,]) # Perform action on fourth row
...
Writing code in this way can lead to large scripts that can be very difficult to manage; for example, if you need to change the name of the function, you need to do it in a variety of places. This code is also not overtly reusable because we’ll need to specify a call for each row in our data—if we try to apply this code to a different data structure, it may not have the same number of rows.
Instead of writing scripts in this manner, we can use a “loop.”
A loop is a programming structure that allows us to perform the same task in a repetitive manner. Two types of loops are supported by R: the “for” loop and the “while” loop.
A “for” loop will perform the same action on each of a pre-specified set of inputs. For example, imagine we have a bag containing 100 potato chips and we have decided we’re going to eat every one. In this case, our “for” loop may be structured as follows:
For each of our 100 chips:
Reach into the bag
Remove a single potato chip
Eat the potato chip
This is a simple repetitive pattern. However, we do need to pre-specify the inputs over which we’re going to iterate. For example, if we didn’t know exactly how many potato chips were in the bag, we cannot use this approach.
By contrast, a “while” loop allows us to perform the same action in a repeated manner until a condition is met. For example, if we had a bag of potato chips and we wanted to eat the contents, we may write a “while” loop as follows:
While there are still chips left in the bag:
Reach into the bag
Remove a single potato chip
Eat the potato chip
Again, this is a simple structure and will work well in our case. However, we need to be sure no one hands us a bag with an infinite number of potato chips, in which case we’ll never “leave” the loop and just keep on eating.
The for
function in R allows us to implement a “for” loop. The structure of the loop is as follows:
for (variable in set_of_values) {
# do this
}
The variable
defined will iteratively take each value of the set_of_values
, and the body of the “for” loop will then be executed. Here’s an example:
> for (i in 1:5) {
+ cat("
Hello") # Say Hello
+ }
Hello
Hello
Hello
Hello
Hello
In this very simple example, i
is iteratively set to each value in vector 1:5
and then the body of the loop is executed—the result is to print the message “Hello” five times.
Note: Using Curly Brackets
In this example, we are using curly brackets to encapsulate the body of code. As with writing functions, we can omit these if the body of code is a single line; therefore, this example could be rewritten as follows:
> for (i in 1:10) cat(" Hello") # Say Hello
As a convention, and as good practice, we will use curly brackets throughout this hour.
In the last example, we set i
to each value in vector 1:5
. If we use i
in the body of the loop, we can more easily see this process:
> for (i in 1:5) {
+ cat("
i has been set to the value of", i)
+ }
i has been set to the value of 1
i has been set to the value of 2
i has been set to the value of 3
i has been set to the value of 4
i has been set to the value of 5
Let’s look at a slightly different example, this time involving a set of character values over which to iterate:
> for (let in LETTERS[1:5]) {
+ cat("
The Letter", let)
+ }
The Letter A
The Letter B
The Letter C
The Letter D
The Letter E
For loops are often used to iterate over data sources, performing actions on groupings within that data. Let’s use the internal airquality
dataset for this example, which contains air quality measurements for New York from May to September 1973:
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
The Month
column stores the month number (May = 5 to September = 9). We can generate a vector of unique month values using the unique
function as follows:
> unique(airquality$Month)
[1] 5 6 7 8 9
What if we wanted to report the average Ozone
value for each month? Without a loop, we might write code like this:
> # Perform summary for Month 5
> ozoneValues <- airquality$Ozone [ airquality$Month == 5 ] # Subset the data
> theMean <- round(mean(ozoneValues, na.rm = TRUE), 2) # Calculate the mean
> cat("
Average Ozone for month 5 =", theMean) # Print the message
Average Ozone for month 5 = 23.62
>
> # Perform summary for Month 6
> ozoneValues <- airquality$Ozone [ airquality$Month == 6 ] # Subset the data
> theMean <- round(mean(ozoneValues, na.rm = TRUE), 2) # Calculate the mean
> cat("
Average Ozone for month 6 =", theMean) # Print the message
Average Ozone for month 6 = 29.44
>
> # Perform summary for Month 7
> ozoneValues <- airquality$Ozone [ airquality$Month == 7 ] # Subset the data
> theMean <- round(mean(ozoneValues, na.rm = TRUE), 2) # Calculate the mean
> cat("
Average Ozone for month 7 =", theMean) # Print the message
Average Ozone for month 7 = 59.12
Note that the only varying aspect between these sections of code is the Month
value itself. Using a for loop, we could iterate over each (unique) month value, calculating summaries specific to that month, as follows:
> for (M in unique(airquality$Month)) {
+ ozoneValues <- airquality$Ozone [ airquality$Month == M ] # Subset the data
+ theMean <- round(mean(ozoneValues, na.rm = TRUE), 2) # Calculate and round
the mean
+ cat("
Average Ozone for month", M, "=", theMean) # Print the message
+ }
Average Ozone for month 5 = 23.62
Average Ozone for month 6 = 29.44
Average Ozone for month 7 = 59.12
Average Ozone for month 8 = 59.96
Average Ozone for month 9 = 31.45
In this example, we are iterating over the unique values of Month
. We use the iterator variable M
to subset the data, saving the result each time as ozoneValues
. We then calculate the mean based on this vector and report the result.
It is possible to perform “nested” loop operations, where we iterate over more than one set of values. For example, let’s again loop through sections of the airquality
dataset, but this time report the average values of the Ozone
, Wind
, and Solar.R
columns. We could extend the last loop as follows:
> for (M in unique(airquality$Month)) {
+
+ cat("
Month =", M, "
=========") # Write Month Number
+ subData <- airquality [ airquality$Month == M, ] # Subset the data
+
+ theMean <- round(mean(subData$Ozone, na.rm = TRUE), 2) # Calculate the mean
+ cat("
Average Ozone = ", theMean) # Print the message
+
+ theMean <- round(mean(subData$Wind, na.rm = TRUE), 2) # Calculate the mean
+ cat("
Average Wind = ", theMean) # Print the message
+
+ theMean <- round(mean(subData$Solar.R, na.rm = TRUE), 2) # Calculate the mean
+ cat("
Average Solar.R = ", theMean) # Print the message
+
+ }
Month = 5
=========
Average Ozone = 23.62
Average Wind = 11.62
Average Solar.R = 181.3
Month = 6
=========
Average Ozone = 29.44
Average Wind = 10.27
Average Solar.R = 190.17
Month = 7
=========
Average Ozone = 59.12
Average Wind = 8.94
Average Solar.R = 216.48
Month = 8
=========
Average Ozone = 59.96
Average Wind = 8.79
Average Solar.R = 171.86
Month = 9
=========
Average Ozone = 31.45
Average Wind = 10.18
Average Solar.R = 167.43
Tip: Tab Characters
Note the use
in the preceding example. This allows us to insert a “tab” symbol when printing text in this way. For this example, it left-aligns the numeric mean values produced. If we wanted to (more correctly) right-align these numeric values, we could additionally call the format
function to convert the numeric values to a nicely formatted character output.
We could instead iterate over values of Month
and then iterate over the columns within Month
using a nested loop, as follows:
> for (M in unique(airquality$Month)) {
+
+ cat("
Month =", M, "
=========") # Write Month Number
+ subData <- airquality [ airquality$Month == M, ] # Subset the data
+
+ for (column in c("Ozone", "Wind", "Solar.R")) { # Iterate over columns
+ theMean <- round(mean(subData[[column]], na.rm = TRUE), 2) # Calculate the
mean
+ cat("
Average", column, "= ", theMean # Print the message
+ }
+
+ }
Month = 5
=========
Average Ozone = 23.62
Average Wind = 11.62
Average Solar.R = 181.3
Month = 6
=========
Average Ozone = 29.44
Average Wind = 10.27
Average Solar.R = 190.17
Month = 7
=========
Average Ozone = 59.12
Average Wind = 8.94
Average Solar.R = 216.48
Month = 8
=========
Average Ozone = 59.96
Average Wind = 8.79
Average Solar.R = 171.86
Month = 9
=========
Average Ozone = 31.45
Average Wind = 10.18
Average Solar.R = 167.43
Note: Referencing Columns
Note that we used the double square brackets notation here as opposed to the $
syntax in the more verbose example. This is because we can’t parameterize values used by $
, as shown in this example:
> airquality$Wind[1:5] # The Wind column
[1] 7.4 8.0 12.6 11.5 14.3
> airquality$"Wind"[1:5] # Also works
[1] 7.4 8.0 12.6 11.5 14.3
> whichColumn <- "Wind" # set value of whichColumn
> airquality$whichColumn # Reference using whichColumn
NULL
We must therefore use a double square bracket notation (or alternatively the [ , whichColumn]
notation) that was introduced in Hour 4, “Multi-Mode Data Structures”.
Note: Loop Performance
Later, in Hour 18, “Code Efficiency,” we will look again at loops and discuss performance and efficiency gains.
Looping through data frames in this way is generally not recommended. As we will see shortly, and again in Hour 12, “Efficient Data Handling in R,” there are many simpler, faster ways to loop through columns or rows in a data frame. However the concept of a for loop is a much more widely applicable programming concept that can help clean up repetitive, unmaintainable code.
The while
function in R allows us to implement a “while” loop. The structure of the “while” loop is as follows:
while (condition) {
# do this
}
The result is that the loop will iterate constantly until the condition
is no longer TRUE
. Of course, if the condition
is always TRUE
, the loop will never stop iterating, so we need to exercise caution.
Let’s look at a simple example:
> index <- 1 # Set value of index to 1
> while(index < 6) {
+ cat("
Hello") # Write a message
+ index <- index + 1 # Update the value of index
+ }
Hello
Hello
Hello
Hello
Hello
Here, we initially set the value of index
to 1. Then, we iteratively write a simple message and increment index
. The loop continues to iterate until the condition (index < 6
) is no longer true.
We can see this more clearly by improving the message produced:
> index <- 1 # Set value of index to 1
> while(index < 6) {
+ cat("
Setting the value of index from", index) # Write a message
+ index <- index + 1 # Update the value of index
+ cat(" to", index) # Write a message
+ }
Setting the value of index from 1 to 2
Setting the value of index from 2 to 3
Setting the value of index from 3 to 4
Setting the value of index from 4 to 5
Setting the value of index from 5 to 6
The majority of functions in R are relatively simple and designed to work with single-mode structures. Consider, for example, the median
function, which can be used to calculate the median of a numeric data object (typically a vector). Let’s have a look at the arguments of the function and a simple example:
> args(median)
function (x, na.rm = FALSE)
NULL
> median( airquality$Wind ) # Median of Wind column
[1] 9.7
We can see that median
has two arguments (x
and na.rm
), which can be used to specify the values for which the median is to be calculated, and a logical value specifying whether missing values should be removed before calculating the median.
What if we wanted to apply this function in a more sophisticated way? Here are some examples:
The median of rows or columns of a matrix
The median of each element of a list
The median of some variable for each level of one or more grouping variables (for example, median sales by age group)
As you have seen earlier in this hour, the loop structure provides a way to iteratively call a function (for example, on subsections of a data object). Although we could apply a function using loops, much of our code would be needed just to reference the subsections of the data we need given the values over which we’re iterating (as you saw previously).
Instead, R provides a set of functions (the “apply” family of functions) that offer a more natural structure for applying simple functions to data structures in a more sophisticated way.
In R, many functions could be considered part of the “apply” family of functions. Let’s start by looking at the set of functions in R of the form “xapply,” where x is an optional letter, using the apropos
function:
> apropos("^[a-z]?apply$") # Find all objects ending in "apply"
[1] "apply" "eapply" "lapply" "mapply" "rapply"
[6] "sapply" "tapply" "vapply"
Note: Other Functions in the “apply” Family
We could conceivably include functions such as by
and aggregate
in the “apply” family given their aims and usage. We’ll cover aggregate
in Hour 11, “Data Manipulation and Transformation,” but will not cover by
in this book given the numerous better ways of performing the tasks by
enables.
Tip: Regular Expressions
As seen in the apropos
call, the regular expression capabilities of R are very useful for looking for patterns in vectors of characters.
The call to apropos returns eight functions, which are listed in Table 9.1.
For now, let’s focus on the first four functions listed in Table 9.1 (apply
, lapply
, sapply
and tapply
).
The apply
function allows us to apply a function over dimensions of a data object. Acceptable inputs to apply
include any object that has a “dimension”—for example, matrices, data frames, and arrays. The arguments to the apply
function are as follows:
> args(apply)
function (X, MARGIN, FUN, ...)
NULL
Table 9.2 details the arguments of the apply
function.
The second argument, the “Margin,” specifies the “dimension number” over which to apply the function, as described in Table 9.3.
We typically specify the margin as a single integer value or vector of integer values.
Note: Named Dimensions
If your structure has dimension names assigned, a character vector can be provided instead of the (more commonly used) vector of integers.
The apply
function is best described with a simple example. First, let’s create a structure that has dimensions:
> myMat <- matrix(rpois(20, 3), nrow = 4) # Create a simple matrix
> myMat # Print myMat
[,1] [,2] [,3] [,4] [,5]
[1,] 5 6 4 2 2
[2,] 1 7 3 1 6
[3,] 2 3 0 3 4
[4,] 2 2 4 3 4
> dim(myMat) # Dimensions of myMat
[1] 4 5
Now let’s use our first call to apply
. In this example, we’ll calculate the maximum of each column (dimension 2) of our matrix:
> apply(myMat, 2, max) # Column Maxima
[1] 5 7 4 3 6
The result is a vector that holds the maximum of each column (for example, we see that the maximum of the values in the second column is 7).
Note: The Use of Random Numbers
In this and the following sections I use functions such as rpois
to generate random samples. Since these are random draws, they will not necessarily match your results if you run the same code.
The apply
function operates by “breaking apart” the structure based on the margin(s) provided and then applying the function to each “piece” of the partitioned structure. In this example, the matrix is split into separate columns with the max
function applied to each column, as illustrated in Figure 9.1.
Now let’s look at another simple example—this time we’ll calculate the minimum of each row (dimension 1) of our matrix:
> apply(myMat, 1, min) # Row Minima
[1] 2 1 0 2
Again, the result is a vector, this time containing the minimum of each row of the matrix (so the minimum value in row 3 is 0). This time, the apply
function “breaks apart” the structure by rows and applies the min
function to each “piece” of the structure, as illustrated in Figure 9.2.
In these simple examples, we specified a single margin in each call (1 for rows or 2 for columns). We can, instead, use multiple margins, as shown here:
> myMat
[,1] [,2] [,3] [,4] [,5]
[1,] 5 6 4 2 2
[2,] 1 7 3 1 6
[3,] 2 3 0 3 4
[4,] 2 2 4 3 4
> apply(myMat, c(1, 2), median) # Median by row AND column
[,1] [,2] [,3] [,4] [,5]
[1,] 5 6 4 2 2
[2,] 1 7 3 1 6
[3,] 2 3 0 3 4
[4,] 2 2 4 3 4
In this example, we’ve calculated the median value by row and column by specifying two values for the margin (1 and 2). This calculates the median of each cell of the matrix (that is, the median of “5” is “5”) and thus returns exactly the same matrix that we started with. This process is visualized in Figure 9.3.
Although this is not of any practical use, it does further illustrate the way the apply
function works.
Although using multiple margins may not be useful for two-dimensional structures (that is, matrices or data frames), when we deal with structures with a higher number of dimensions it can be useful. To illustrate this, let’s create a three-dimensional array:
> myArray <- array( rpois(18, 3), dim = c(3, 3, 2)) # Create array
> myArray # Print myArray
, , 1
[,1] [,2] [,3]
[1,] 2 2 4
[2,] 4 3 1
[3,] 4 1 1
, , 2
[,1] [,2] [,3]
[1,] 0 6 3
[2,] 4 3 1
[3,] 1 5 1
> dim(myArray) # Dimensions of myArray
[1] 3 3 2
Now, there are three dimensions over which we could apply our functions. Let’s try to apply a function over dimension 3 of the array:
> apply(myArray, 3, min)
[1] 1 0
Here, the array is first broken apart based on dimension 3, resulting in 2×2-dimensional structures. The min
function is then applied to each of the two structures, as illustrated in Figure 9.4.
Instead, we could provide multiple margins. For example, let’s apply the max
function, this time over dimensions 1 and 2:
> apply(myArray, c(1, 2), max)
[,1] [,2] [,3]
[1,] 2 6 4
[2,] 4 3 1
[3,] 4 5 1
This time the structure is “collapsed” over the third dimension, producing a matrix of outputs. This process is illustrated in the Figure 9.5.
Let’s return to our matrix example, but this time insert a missing value:
> myMat[2, 2] <- NA # Add a missing value in cell 2, 2
> myMat # Print the matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 5 6 4 2 2
[2,] 1 NA 3 1 6
[3,] 2 3 0 3 4
[4,] 2 2 4 3 4
Now, let’s once again apply a function. For example, let’s calculate the maximum of each column (dimension 2) of the matrix:
> apply(myMat, 2, max) # Maximum of each column
[1] 5 NA 4 3 6
This time, our output contains a missing value. The reason for this is that when the second column is passed into the max
function, the missing value causes the max
function to return an NA
value. This is illustrated in Figure 9.6.
We can also see this behavior directly by calculating the maximum of the second column:
> max(myMat[,2]) # Maximum of 2nd column
[1] NA
As you saw earlier, functions such as max
have a na.rm
argument, which allows us to specify that missing values are removed before performing the calculation:
> max(myMat[,2], na.rm = TRUE) # Maximum of 2nd column
[1] 6
If we want to call a function but also pass additional arguments, we can take advantage of the ellipsis argument to apply
, as follows:
> args(apply) # Ellipsis is 4th argument
function (X, MARGIN, FUN, ...)
NULL
> apply(myMat, 2, max, na.rm = TRUE) # Maximum of each column
[1] 5 6 4 3 6
As you can see, the max
function is now called with the argument na.rm
set to TRUE
, so the maximum of the (nonmissing) values of column 2 is now reported.
We can pass as many additional arguments as we need. For example, let’s calculate the quantiles of a slightly larger matrix using the quantile
function:
> biggerMat <- matrix( rpois(300, 3), ncol = 3) # Create a 100 x 3 matrix
> head(biggerMat) # First few rows
[,1] [,2] [,3]
[1,] 4 2 3
[2,] 5 3 5
[3,] 4 7 1
[4,] 5 3 3
[5,] 3 3 4
[6,] 1 5 4
> apply(biggerMat, 2, quantile) # Column quantiles
[,1] [,2] [,3]
0% 0 0 0
25% 2 2 2
50% 3 3 3
75% 4 4 4
100% 8 8 8
Now, let’s artificially add a number of missing values; therefore, we need to pass the extra na.rm
argument to quantile
:
> biggerMat [ sample( 1:300, 50) ] <- NA # Randomly add some missings
> head(biggerMat) # First few rows
[,1] [,2] [,3]
[1,] 4 2 NA
[2,] 5 3 NA
[3,] 4 7 1
[4,] 5 3 3
[5,] NA NA 4
[6,] 1 NA 4
> apply(biggerMat, 2, quantile, na.rm = TRUE) # Column quantiles
[,1] [,2] [,3]
0% 0 0 0
25% 2 2 1
50% 3 3 3
75% 4 4 4
100% 8 8 8
The quantile
function has an argument, probs
, that allows us to specify that a different set of quantiles are returned. Let’s additionally pass the probs
argument to specify some new quantiles:
> apply(biggerMat, 2, quantile,
+ probs = c(0, .05, .5, .95, 1), na.rm = TRUE) # Column quantiles
[,1] [,2] [,3]
0% 0 0.00 0
5% 0 1.00 1
50% 3 3.00 3
95% 6 6.15 6
100% 8 8.00 8
So far in this hour, we have used simple functions to illustrate the use of the apply
function (for example, row minima, column maxima). There are in fact several utility functions designed for this very purpose, for example rowMeans
, colMeans
, rowSums
, and colSums
. However, we can also create our own functions and “apply” those over dimensions instead.
Consider the matrix we created earlier:
> myMat
[,1] [,2] [,3] [,4] [,5]
[1,] 5 6 4 2 2
[2,] 1 7 3 1 6
[3,] 2 3 0 3 4
[4,] 2 2 4 3 4
Let’s imagine we want to count the number of values in each column that are greater than 3. There isn’t currently a function in R that will return “the number of values greater than 3,” so let’s create one:
> above3 <- function(vec) {
+ sum(vec > 3)
+ }
> above3( c(1, 6, 5, 1, 2, 3) ) # Try out our function
[1] 2
In the same way as before, we can now “apply” this function across dimensions of our matrix. So to calculate the number of values in each column that are greater than 3, we use the following code:
> apply(myMat, 2, above3) # Number of values > 3 in each column
[1] 1 2 2 0 3
In this example, we created the function above3
and “applied” it to our structure. If we wanted to use above3
for other uses, this is fine. However, if this is only something we want to do once, we can define the function directly in the apply
call (so it is never created as an R object in our session). To achieve this, we replace the function object with the definition as follows:
> apply(myMat, 2, function(vec) {
+ sum(vec > 3)
+ })
[1] 1 2 2 0 3
Tip: One-Line Function Definitions
As before, we can omit the {} (curly brackets) if our function can be defined on a single line. As such, the preceding code could be rewritten as follows:
> apply(myMat, 2, function(vec) sum(vec > 3))
[1] 1 2 2 0 3
As a convention, we will use the curly brackets consistently throughout this hour.
As shown earlier, if we want to pass additional arguments, we can list them after the function call. We can do the same for the functions we write. For example, let’s update our function with a second argument to control the threshold value for counting:
> aboveN <- function(vec, N) {
+ sum(vec > N)
+ }
> someValues <- c(1, 6, 5, 1, 2, 3)
> aboveN( someValues, N = 3 ) # Number > 3
[1] 2
> aboveN( someValues, N = 5 ) # Number > 4
[1] 1
If we “apply” this function to columns of our matrix, we need to additionally pass the N
argument:
> myMat # Print the matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 5 6 4 2 2
[2,] 1 7 3 1 6
[3,] 2 3 0 3 4
[4,] 2 2 4 3 4
> apply(myMat, 2, aboveN, N = 3) # Number > 3
[1] 1 2 2 0 3
> apply(myMat, 2, aboveN, N = 4) # Number > 4
[1] 1 2 0 0 1
If, instead, we want to define the function directly in the apply
call, we would need to list the additional arguments after the definition itself:
> apply(myMat, 2, function(vec, N) {
+ sum(vec > N)
+ }, N = 3)
[1] 1 2 2 0 3
Throughout this hour, we have used single-mode structures (matrices and arrays) as sample inputs to the apply
function. However, because we can use any structure that has a dimension, we could also use apply
with data frames. As an example, let’s “apply” the median function to columns of the airquality
data frame:
> head(airquality) # First few rows
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> apply(airquality, 2, median, na.rm = TRUE) # Median of each column
Ozone Solar.R Wind Temp Month Day
31.5 205.0 9.7 79.0 7.0 16.0
This command returns the median of each column (although, perhaps the “median Month
” and “median Day
” are not that useful). Now let’s consider a second example, this time using the iris
data frame:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> apply(iris, 2, median, na.rm = TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
NA NA NA NA NA
Warning messages:
1: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
3: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
4: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
5: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
This time, the output returns missing values along with a number of warning messages—but why is this?
When we apply functions over dimensions of single-mode structures (for example, matrices and arrays), we know the “mode” of data being passed to our function is the same each time it is called (that is, if we have a numeric matrix, we know that each column will necessarily be numeric).
By comparison, a data frame is a multi-mode structure, so each column may (or may not) be of the same mode. When we call “apply,” R will first break the data and store it in a single-mode structure—at this point, all the data is coerced to a single mode, which may or may not be a suitable input to the function.
With the airquality
example, the apply
function first structures the data into a single-mode (numeric) object and then applies the median
function to each (numeric) column. With the iris
data frame, the Species
column is not a numeric column, so when the data is structured into a single-mode object, the resulting data is no longer numeric. We can see this in the following call, where we query the class
of each column of the data:
> apply(iris, 2, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"character" "character" "character" "character" "character"
So, when R then attempts to apply the median
function to each column, the missing values and warning messages are produced.
So, in summary, we can use apply
with data frames, but we have to take care that data over which we’re “applying” can be adequately combined into a single mode. For example, if we wanted to calculate the mean of all numeric columns of iris
, we could use this approach:
> # Apply median function over the first 4 columns of iris
> apply(iris[,-5], 2, median, na.rm = TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.80 3.00 4.35 1.30
The lapply
function applies functions to each element of a list and always returns a list structure as its output. For example, let’s create a list of numeric vectors and calculate the median of each element. First, we’ll create the list:
> myList <- list(P1 = rpois(10, 1), P3 = rpois(10, 3), P5 = rpois(10, 5))
> myList
$P1
[1] 1 2 2 2 1 0 0 1 1 4
$P3
[1] 0 1 4 0 2 3 2 2 1 6
$P5
[1] 5 4 9 6 6 4 6 5 3 5
To use the lapply
function, we simply pass the list and the function to apply (there is no “margin” here because the data is already “split” into list elements):
> lapply(myList, median)
$P1
[1] 1
$P3
[1] 2
$P5
[1] 5
In the preceding example, the lapply
call itself was actually a lot simpler (and more concise) than the code used to create the sample list. In a slight departure, let’s quickly look at a simple function that creates lists (which we could then use as examples in lapply
). This function is called split
.
The split
function divides a data structure into separate parts based on one or more grouping variables. The output from a split
is a list. As a first example, let’s split the Wind
column from airquality
based on levels of Month
. We can achieve that by calling split
with the Wind
column as the first input and the “grouping” column (Month
) as the second argument. Note that the output is a list:
> spWind <- split(airquality$Wind, airquality$Month)
> $`5`
[1] 7.4 8.0 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 6.9 9.7 9.2
[14] 10.9 13.2 11.5 12.0 18.4 11.5 9.7 9.7 16.6 9.7 12.0 16.6 14.9
[27] 8.0 12.0 14.9 5.7 7.4
$`6`
[1] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 10.9 9.2 8.0
[14] 13.8 11.5 14.9 20.7 9.2 11.5 10.3 6.3 1.7 4.6 6.3 8.0 8.0
[27] 10.3 11.5 14.9 8.0
$`7`
[1] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 8.6 14.3 14.9
[14] 14.9 14.3 6.9 10.3 6.3 5.1 11.5 6.9 9.7 11.5 8.6 8.0 8.6
[27] 12.0 7.4 7.4 7.4 9.2
Given that this structure is a list, it is a suitable input to the lapply
function. Let’s calculate the median value of each element of spWind
:
> lapply(spWind, median)
$`5`
[1] 11.5
$`6`
[1] 9.7
$`7`
[1] 8.6
$`8`
[1] 8.6
$`9`
[1] 10.3
This result is, therefore, the median Wind
value for each level of Month
, or the “median Wind
by Month
.”
Note: Nested Calls to lapply and split
In the preceding example, we separated the split
and lapply
calls for clarity. We could, of course, combine them into a single call, as follows:
> lapply(split(airquality$Wind, airquality$Month), median)
Or
> with(airquality, lapply(split(Wind, Month), median))
In the preceding example, we split a vector based on levels specified in another vector. The split
function can also be used to divide data frames. For example, let’s split our airquality
data based on Month
:
> spAir <- split(airquality, airquality$Month) # Split the data
> length(spAir) # Length of list
[1] 5
> names(spAir) # Element names
[1] "5" "6" "7" "8" "9"
> head(spAir[[1]]) # First element
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
As you can see, this creates a list of length 5 where each element contains a data frame containing data for only one month. Now let’s use lapply
to apply a function to each data frame stored in this list. We need to apply a function that will perform an operation on a data frame, so let’s return the first three rows in each element of the list using head
:
> lapply(spAir, head, n = 3)
$`5`
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
$`6`
Ozone Solar.R Wind Temp Month Day
32 NA 286 8.6 78 6 1
33 NA 287 9.7 74 6 2
34 NA 242 16.1 67 6 3
$`7`
Ozone Solar.R Wind Temp Month Day
62 135 269 4.1 84 7 1
63 49 248 9.2 85 7 2
64 32 236 9.2 81 7 3
$`8`
Ozone Solar.R Wind Temp Month Day
93 39 83 6.9 81 8 1
94 9 24 13.8 81 8 2
95 16 77 7.4 82 8 3
$`9`
Ozone Solar.R Wind Temp Month Day
124 96 167 6.9 91 9 1
125 78 197 5.1 92 9 2
126 73 183 2.8 93 9 3
Perhaps instead we could lapply
our own function to each data frame. For example, let’s create a function that calculates column means for the Ozone
, Solar.R
, Wind
, and Temp
variables:
> lapply(spAir, function(df) {
+ apply(df[,1:4], 2, median, na.rm = TRUE)
+ })
$`5`
Ozone Solar.R Wind Temp
18.0 194.0 11.5 66.0
$`6`
Ozone Solar.R Wind Temp
23.0 188.5 9.7 78.0
$`7`
Ozone Solar.R Wind Temp
60.0 253.0 8.6 84.0
$`8`
Ozone Solar.R Wind Temp
52.0 197.5 8.6 82.0
$`9`
Ozone Solar.R Wind Temp
23.0 192.0 10.3 76.0
Here, each element of spAir
is passed into the function we defined as input: df
. Then, for each df
, we calculate the column means of the first four columns.
Note: Splitting on Multiple Variables
You’ve seen that the split
function can be used to divide data structures (such as vectors or data frames) into elements of a list based on values of another vector. We can split by more than one variable by passing a list of factors:
> split(airquality$Wind, list(airquality$Month, cut(airquality$Temp, 3)))
$`5.(56,69.7]`
[1] 7.4 11.5 14.3 14.9 8.6 13.8 20.1 8.6 9.7 9.2 10.9 13.2 11.5 12.0 18.4 11.5 9.7
[18] 9.7 9.7 12.0 16.6 14.9 8.0 12.0
$`6.(56,69.7]`
[1] 16.1 9.2
$`7.(56,69.7]`
numeric(0)
...
This could then be passed to lapply
to calculate summaries by more than one grouping variable.
At the start of this section, we said that the lapply
function will apply a function to each element of a list. However, if we instead pass a vector to the lapply
function, it will convert it to a list using the as.list
function as follows:
> as.list(1:5)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
That means we can use lapply
to apply a function to each element of a vector. Let’s consider a simple example, where we apply the rnorm
function to values 1 to 5:
> lapply(1:5, rnorm)
[[1]]
[1] 0.8168998
[[2]]
[1] -0.8863575 -0.3315776
[[3]]
[1] 1.1207127 0.2987237 0.7796219
[[4]]
[1] 1.4557851 -0.6443284 -1.5531374 -1.5977095
[[5]]
[1] 1.8050975 -0.4816474 0.6203798 0.6121235 -0.1623110
This is equivalent to the following:
> list(
+ rnorm(1),
+ rnorm(2),
+ rnorm(3),
+ rnorm(4),
+ rnorm(5)
+ )
[[1]]
[1] 0.8118732
[[2]]
[1] 2.196834 2.049190
[[3]]
[1] 1.6324456 0.2542712 0.4911883
[[4]]
[1] -0.32408658 -1.66205024 1.76773385 0.02580105
[[5]]
[1] 1.1285108 -2.3803581 -1.0602656 0.9371405 0.8544517
Let’s add a second argument to rnorm
. For example, let’s specify a mean for the Normal distribution:
> lapply(1:5, rnorm, mean = 10)
[[1]]
[1] 11.46073
[[2]]
[1] 8.586901 10.567403
[[3]]
[1] 10.583188 8.693201 9.459614
[[4]]
[1] 11.947693 10.053590 10.351663 9.329023
[[5]]
[1] 10.277954 10.691171 10.823795 12.145065 7.653056
When the lapply
function (like all “apply” functions) passes the data to the function, the data is passed as the first input and is not named. So, the last example is equivalent to this:
> list(
+ rnorm(1, mean = 10),
+ rnorm(2, mean = 10),
+ rnorm(3, mean = 10),
+ rnorm(4, mean = 10),
+ rnorm(5, mean = 10)
+ )
[[1]]
[1] 10.14959
[[2]]
[1] 8.657469 10.553303
[[3]]
[1] 11.589963 9.413120 8.167623
[[4]]
[1] 10.888139 11.593488 10.516855 8.704328
[[5]]
[1] 10.054616 9.215351 8.950647 12.330512 11.402705
Let’s quickly remind ourselves of the arguments of rnorm
:
> args(rnorm)
function (n, mean = 0, sd = 1)
NULL
The first argument to rnorm
, the number of values to sample, is called n
. Although the lapply
function is not “naming” the first input, the order-based method for specifying arguments in a function means that it is this “n” input that accepts each of the values, 1 to 5. What if we, instead, specify the first argument (n
) as an extra parameter?
> lapply(1:5, rnorm, n = 5)
[[1]]
[1] 1.9426009 1.8262583 0.1884595 1.4762483 2.0212584
[[2]]
[1] 2.645383 3.043144 1.695631 4.477111 2.971221
[[3]]
[1] 4.867099 3.672042 2.692047 3.536524 3.824870
[[4]]
[1] 3.036099 3.144917 5.886947 3.608181 3.019367
[[5]]
[1] 5.687332 4.494956 7.157720 4.400202 4.305453
This produces a slightly different output, where each element of the list is a sample of five values from a Normal distribution. Here, the lapply
call is equivalent to the following:
> list(
+ rnorm(1, n = 5),
+ rnorm(2, n = 5),
+ rnorm(3, n = 5),
+ rnorm(4, n = 5),
+ rnorm(5, n = 5)
+ )
[[1]]
[1] 1.2239254 -0.1562233 1.4224185 -0.3247553 1.1410843
[[2]]
[1] 1.463952 1.688394 3.556110 1.551967 2.321124
[[3]]
[1] 1.769828 1.675941 4.261242 4.319232 2.919246
[[4]]
[1] 3.494910 3.947846 4.628861 6.180002 3.930983
[[5]]
[1] 6.544864 6.321452 5.322152 6.530955 4.578760
In this case, we are explicitly naming the “n
” input and setting it to 5, which explains why five samples are being returned in each list element. Therefore, the values we pass to the function (1 to 5) are instead used as the second input: the mean of the distribution from which to sample. In other words, this code returns the following:
Five samples from a Normal distribution with mean 1
Five samples from a Normal distribution with mean 2
Five samples from a Normal distribution with mean 3
Five samples from a Normal distribution with mean 4
Five samples from a Normal distribution with mean 5
As a natural extension, if we specify the n
and mean
inputs, then each value of 1 to 5 will move to the third argument (the standard deviation).
As you saw in Hour 4, data frames are structured as lists of vectors. Therefore, we can use lapply
to apply functions to each column of a data frame as follows:
> lapply(airquality, median, na.rm = TRUE)
$Ozone
[1] 31.5
$Solar.R
[1] 205
$Wind
[1] 9.7
$Temp
[1] 79
$Month
[1] 7
$Day
[1] 16
This is a similar process to using apply
to apply functions over columns of a data frame. The two primary differences are as follows:
The lapply
function always returns a list.
When using apply
, the structures are first put into a single-mode structure before processing, whereas the lapply
function does not attempt to combine columns between processing.
The last point here can be illustrated by the following example, where we look at the class
of each column in our data frame:
> apply(airquality, 2, class)
Ozone Solar.R Wind Temp Month Day
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
> lapply(airquality, class)
$Ozone
[1] "integer"
$Solar.R
[1] "integer"
$Wind
[1] "numeric"
$Temp
[1] "integer"
$Month
[1] "integer"
$Day
[1] "integer"
Note that, by the time the class
function is applied in our first example, the apply
function has already structured the data into a single-mode structure (so all data is forced to be of the same mode). With lapply
, this coercion is not done, so we see instances of “numeric” (the “Wind” column) and “integer” column classes reported.
The sapply
function is a simple wrapper for the lapply
function. In fact, the call to lapply
can be clearly seen on the second line of the sapply
function body:
> sapply
function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
{
FUN <- match.fun(FUN)
answer <- lapply(X = X, FUN = FUN, ...)
if (USE.NAMES && is.character(X) && is.null(names(answer))) names(answer) <- X
if (!identical(simplify, FALSE) && length(answer))
simplify2array(answer, higher = (simplify == "array"))
else answer
}
Therefore, as with lapply
, the sapply
function allows us to apply functions to elements of a list (or vector). The primary difference is that, whereas lapply
always returns a list, sapply
will (by default) attempt to simplify the return object using the simplify2array
function.
To illustrate this, let’s look back at an earlier example where we use lapply
and split
to calculate the median values of Wind
by Month
:
> lapply(split(airquality$Wind, airquality$Month), median)
$`5`
[1] 11.5
$`6`
[1] 9.7
$`7`
[1] 8.6
$`8`
[1] 8.6
$`9`
[1] 10.3
If we replace the lapply
function with the sapply
function, we get a simpler output (in this case, a named vector):
> sapply(split(airquality$Wind, airquality$Month), median)
5 6 7 8 9
11.5 9.7 8.6 8.6 10.3
For another example, let’s use sapply
to see the class of each column of the iris
data frame:
> sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"numeric" "numeric" "numeric" "numeric" "factor"
The return values from sapply
can sometimes be rather unpredictable. That is because sapply
will attempt to simplify the return structure (which may result in a nicely formatted structure) but is often not able to simplify the return (in which case it stays as a list). Table 9.4 summarizes the return values, which depend on the number of values returned from the “applied” function.
Some examples showing the various return objects are provided here:
> myList <- list(P1 = rpois(5, 1), P3 = rpois(5, 3), P5 = rpois(5, 5))
>
> # Function that (always) returns a single value > vector output
> sapply(myList, median)
P1 P3 P5
1 3 4
> # Function that (always) returns 2 values > matrix output
> sapply(myList, range)
P1 P3 P5
[1,] 0 1 3
[2,] 3 4 6
> # Function that (always) returns 5 values > matrix output
> sapply(myList, quantile)
P1 P3 P5
0% 0 1 3
25% 0 3 4
50% 1 3 4
75% 2 3 5
100% 3 4 6
> # Function that can return a variable number of values > list output
> sapply(myList, function(X) X [ X > 2 ])
$P1
[1] 3
$P3
[1] 3 3 3 4
$P5
[1] 3 5 4 4 6
> # Function that can return a variable number of values
> # BUT it happens that the return values are of the same
> # length in this instance > simplification occurs
> sapply(myList, function(X) min(X):max(X))
P1 P3 P5
[1,] 0 1 3
[2,] 1 2 4
[3,] 2 3 5
[4,] 3 4 6
At this point, you may be wondering why we’d ever need to use lapply
given that sapply
returns a “simpler” output.
The key reason for using lapply
instead of sapply
is that you always know a list will be returned, whereas the returns from sapply
can be unpredictable, particularly when the function applied can return a variable number of values (as seen previously). When we write code, we need to be sure of the structure returned so we can write code to deal with that structure—for example, imagine writing a script where you expect the return output from an sapply
call to be a list, but then it is unexpectedly simplified to an array (as seen in the last example).
More generally, there may be times when you explicitly don’t want to try and simplify the output. Consider a situation where we have a list containing two matrices:
> matList <- list(
+ P3 = matrix( rpois(8, 3), nrow = 2),
+ P5 = matrix( rpois(8, 5), nrow = 2)
+ )
> matList
$P3
[,1] [,2] [,3] [,4]
[1,] 8 1 1 4
[2,] 4 2 8 2
$P5
[,1] [,2] [,3] [,4]
[1,] 5 4 3 2
[2,] 1 7 7 1
Now let’s use our lapply
and sapply
functions to extract the first row of each matrix:
> lapply(matList, head, 1)
$P3
[,1] [,2] [,3] [,4]
[1,] 8 1 1 4
$P5
[,1] [,2] [,3] [,4]
[1,] 5 4 3 2
> sapply(matList, head, 1)
P3 P5
[1,] 8 5
[2,] 1 4
[3,] 1 3
[4,] 4 2
As you can see, the lapply
function has returned a list, whereas the sapply
function has simplified the output by combining the results into a single (matrix) structure. If these two matrices were measurements on two different systems, we may want to ensure the results are analyzed separately, so combining them into a single structure is not desirable.
The tapply
function allows us to apply a function to elements of a vector, grouped by levels of one or more other variables. The primary arguments to tapply
are described in Table 9.5.
Let’s look at a simple example of tapply
used to calculate the median Wind
by Month
using the airquality
data:
> tapply(airquality$Wind, airquality$Month, median)
5 6 7 8 9
11.5 9.7 8.6 8.6 10.3
As you can see, in this case tapply
returns a named vector of values, containing the median Wind
values by Month
.
Note: Similarity to split + sapply
This is very similar to an earlier example using sapply
and split
:
> sapply(split(airquality$Wind, airquality$Month), median)
5 6 7 8 9
11.5 9.7 8.6 8.6 10.3
In fact, tapply
is primarily a wrapper for a call to the split
and sapply
(technically, lapply
with a simplify step) functions.
We can specify more than one grouping variable by which to process the data—this is achieved by providing a list of factors instead of a single factor. Let’s calculate the median Wind
by Month
and grouped Temp
(which we’ll create using the cut
function):
> tapply(airquality$Wind,
+ list(airquality$Month, cut(airquality$Temp, 3)), median)
(56,69.7] (69.7,83.3] (83.3,97]
5 11.50 8.0 NA
6 12.65 9.7 9.2
7 NA 9.2 7.4
8 NA 10.3 7.4
9 12.05 10.3 6.0
The return from this function is a matrix with the levels of the first grouping variable (Month
) set as the rows (dimension 1) and the levels of the second grouping variable (Temp
) in columns (dimension 2).
Caution: Missing Values in Return Structure
In the preceding example, a number of missing values have been returned. Usually when we see a missing value, it presents a value that “exists” but one we do not know. Consider the missing value for high temperature values in Month 5 in this example. It is difficult to know whether this value is generated because
There were Wind
values in Month 5 for high temperatures, but they contained missing values so we do now know the median value.
There were actually no values in Month 5 for high temperatures (that is, there is no data).
In fact, in this case, the latter is true—there were no days in Month 5 when the temperature went above 83.3 degrees Fahrenheit. So, this missing value represents a “lack” of data. However, care should be taken when interpreting the results.
Let’s extend this example a little further, calculating the median Wind
by Month
levels of Temp
and levels of Solar.R
:
> tapply(airquality$Wind,
+ list(airquality$Month, cut(airquality$Temp, 3), cut(airquality$Solar.R, 2)),
+ median)
, , (6.67,170]
(56,69.7] (69.7,83.3] (83.3,97]
5 12.60 10.3 NA
6 9.20 8.0 NA
7 NA 8.6 11.45
8 NA 9.7 8.60
9 13.45 10.3 7.40
, , (170,334]
(56,69.7] (69.7,83.3] (83.3,97]
5 10.90 11.15 NA
6 16.10 12.65 9.2
7 NA 9.70 7.4
8 NA 10.90 8.0
9 12.05 10.30 4.6
This now creates a three-dimensional array of output, where each of our three grouping variables is aligned to a dimension.
In the preceding example, we used the median
function to illustrate the use of tapply
, which will always return a single value. If, instead, our function returns multiple values, the outputs from tapply
can be unexpected and, occasionally, highly complex. Let’s start with a simple example, this time calculating quantiles of Wind
values by Month
:
> tapply(airquality$Wind, airquality$Month, quantile)
$`5`
0% 25% 50% 75% 100%
5.70 8.90 11.50 14.05 20.10
$`6`
0% 25% 50% 75% 100%
1.7 8.0 9.7 11.5 20.7
$`7`
0% 25% 50% 75% 100%
4.1 6.9 8.6 10.9 14.9
$`8`
0% 25% 50% 75% 100%
2.3 6.6 8.6 11.2 15.5
$`9`
0% 25% 50% 75% 100%
2.800 7.550 10.300 12.325 16.600
We can see that, with multiple return values, no simplification is performed and a list is returned. This is the equivalent of the following:
> lapply(split(airquality$Wind, airquality$Month), quantile)
$`5`
0% 25% 50% 75% 100%
5.70 8.90 11.50 14.05 20.10
$`6`
0% 25% 50% 75% 100%
1.7 8.0 9.7 11.5 20.7
$`7`
0% 25% 50% 75% 100%
4.1 6.9 8.6 10.9 14.9
$`8`
0% 25% 50% 75% 100%
2.3 6.6 8.6 11.2 15.5
$`9`
0% 25% 50% 75% 100%
2.800 7.550 10.300 12.325 16.600
Now let’s extend this example to calculate the quantiles by Month
and (grouped) Temp
:
> tapply(airquality$Wind,
+ list(airquality$Month, cut(airquality$Temp, 3)), quantile)
(56,69.7] (69.7,83.3] (83.3,97]
5 Numeric,5 Numeric,5 NULL
6 Numeric,5 Numeric,5 Numeric,5
7 NULL Numeric,5 Numeric,5
8 NULL Numeric,5 Numeric,5
9 Numeric,5 Numeric,5 Numeric,5
The “simplification” process has now forced the outputs into a matrix, creating a “matrix of lists,” which is a particularly complex and unhelpful structure:
> X <- tapply(airquality$Wind,
+ list(airquality$Month, cut(airquality$Temp, 3)), quantile)
> class(X)
[1] "matrix"
> X
(56,69.7] (69.7,83.3] (83.3,97]
5 Numeric,5 Numeric,5 NULL
6 Numeric,5 Numeric,5 Numeric,5
7 NULL Numeric,5 Numeric,5
8 NULL Numeric,5 Numeric,5
9 Numeric,5 Numeric,5 Numeric,5
> X[1,1]
[[1]]
0% 25% 50% 75% 100%
7.400 9.700 11.500 13.925 20.100
As with sapply
, the returns from tapply
can sometimes be difficult to predict. Table 9.6 summarizes the return objects from tapply
based on the number of return values from a function and the number of grouping variables.
Given that tapply
may return unexpected (and/or highly complex) values, we recommend the use of lapply
and split
instead of tapply
, unless we can guarantee the number of return values from the function (so we can rely on the outputs).
Tip: The plyr Package
The plyr package was developed and is maintained by popular R package author, Hadley Wickham. It was first released to CRAN in 2008 and is still one of the most popular R packages on CRAN, with a huge number of packages depending on plyr functionality. The plyr package offers a more consistent “apply” syntax based on the input and output structures to which we apply a function. Functions follow the form [i][o]ply, where i
and o
represent the input and output format respectively. For example the function llply
expects a list input and produces a list output:
> air <- split(airquality, airquality$Month)
> llply(air, dim)
In addition to providing an alternative apply framework plyr offers data manipulation functionality such as merging and aggregation. However, for those working with data frames, the dplyr package that you will be introduced to in Hour 12 provides a much more user-friendly approach to data manipulation and aggregation.
In this hour, we have looked at a number of ways we can apply simple functions to data structures in a more sophisticated way. Specifically, we’ve look at
The use of loops to iterate over data objects
The rich set of “apply” functions
Together, this provides a range of capabilities of summarizing data and performing tasks in a repetitive manner. In later hours, we’ll extend this to cover higher-level mechanisms for processing and aggregating data, with a focus on summarizing data frames. In Hour 18, we’ll also look again at loops and “apply” functions with respect to coding efficiency and performance.
Q. How can I stop a “for” loop if a certain condition is met?
A. You can stop the for loop using the break
construct, as follows:
> for (i in 1:100) {
+ cat("
Hello") # Writing a message
+ if (runif(1) > .9) {
+ cat(" - STOP!!")
+ break # 90% chance of stopping each time
+ }
+ }
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello - STOP!!
Q. How do I stop the process if I get stuck in an infinite “while” loop?
A. You can use the Esc key (in interactive mode) to stop the process.
Q. How could I apply a function over multiple lists at the same time?
A. The mapply
function is a multivariate version of sapply
, which allows us to apply functions over multiple lists at the same time. For example, let’s apply the rpois
function over elements 1:5
(for the number of values to sample) and 5:1
(for the lambda values to use):
> mapply(rpois, n = 1:5, lambda = 5:1)
[[1]]
[1] 2
[[2]]
[1] 7 3
[[3]]
[1] 4 1 1
[[4]]
[1] 1 0 2 4
[[5]]
[1] 3 0 1 0 2
Q. How performant is a “for” loop compared to, say, an “apply” function?
A. Generally, the R language is optimized for vectorized operations, and it is quite possible to write very underperforming code using (nested) for loops. The “apply” family of functions can add some gains in terms of both performance and code maintenance. This will be discussed further in Hour 18.
The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.
1. What is the difference between a “for” and a “while” loop?
2. If you use a for loop to iterate over a vector of (character) column names, how would you use each value to reference a column in a data frame?
3. When using the apply
function, what does the MARGIN
argument control?
4. How do you pass additional arguments to a function you wish to “apply”?
5. What is the difference between sapply
and lapply
?
6. What does the split
function do, and how can you use it in conjunction with lapply
/sapply
?
7. When using tapply
, how do you specify that a summary is to be performed “by” more than one variable?
1. A “for” loop will iterative for a predefined set of values. A “while” loop instead iterates until a specified condition is no longer true.
2. If the condition results in a single missing value, then an error is returned:
> testMissing <- function(X) {
+ if (X > 0) cat("Success")
+ }
> testMissing(NA)
Error in if (X > 0) cat("Success") :
missing value where TRUE/FALSE needed
If you use the all
function with a condition that contains any missing values, the result is missing and therefore will also result in an error (since you do not know if “all” the conditions are met):
> allMissings <- rep(NA, 5) # All missing values
> someMissings <- c(NA, 1:4) # Some missing values
> all(allMissings > 0)
[1] NA
> all(someMissings > 0)
[1] NA
If we use the any
function with a condition that contains all missing values, the result is a missing value. If, however, you use the any
function with a vector where not all values are missing, some conditions may be met:
> any(allMissings > 0)
[1] NA
> any(someMissings > 0)
[1] TRUE
3. The MARGIN
argument controls the dimension over which you want to apply your function (for example, 1 for rows, 2 for columns).
4. Each “apply” function has an ellipsis argument where you list additional arguments—for example, apply(Y, X min, na.rm = TRUE)
.
5. The lapply
function applies a function to elements of a list (or vector) and (always) returns its results in a list. The sapply
function performs exactly the same actions but, where possible, will try to simplify the output (for example, as a vector or array).
6. The split
function will take a data object (typically a vector or data frame) and break it into parts based on one or more grouping variables, storing the results as a list. When the results are “broken” into a list structure, we can use lapply
or sapply
to apply a function to each element—for example, you can calculate the mean Y by levels of X using the following:
sapply(split(Y, X), mean)
7. You can specify multiple “by” variables using a list as follows:
tapply(Y, list(X1, X2), mean)
1. Create a “for” loop that iteratively prints each element of LETTERS
on a new line.
2. Create a “for” loop that prints the mean mpg
value (from the mtcars
dataset) for each unique level of the carb
variable.
3. Look at the provided WorldPhones
matrix, which contains the total number of phones in different regions of the world between 1951 and 1961. Use the apply
function to calculate the total number of phones by year and the maximum number of phones by region.
4. Create a list containing three numeric vectors. Use lapply
or sapply
to print the median value from each element of the list.
5. Use split
together with sapply
to calculate the median value of mpg
(from the mtcars
data) by levels of carb
.
6. Use split
together with lapply
to calculate a summary (?summary
) of the iris
data by levels of Species
.