Hour 9. Loops and Summaries


What You’ll Learn in This Hour:

Image How to perform iterative “looping” techniques in R

Image How to apply functions to complex data structures

Image How to calculate metrics “by” one or more variables


Throughout this book you have seen how to use, and even create, simple R functions. In this hour, we are going to use simple functions and code in a more “applied” fashion. This allows us to perform tasks repeatedly over sections of our data without the need to produce verbose, repetitive code.

Repetitive Tasks

Imagine we want to perform the same task multiple times—for example, on each row of some dataset, df. We might first create a simple function, performAction, and then write a verbose R script such as this:

> performAction(df[1,])  # Perform action on first row
> performAction(df[2,])  # Perform action on second row
> performAction(df[3,])  # Perform action on third row
> performAction(df[4,])  # Perform action on fourth row
...

Writing code in this way can lead to large scripts that can be very difficult to manage; for example, if you need to change the name of the function, you need to do it in a variety of places. This code is also not overtly reusable because we’ll need to specify a call for each row in our data—if we try to apply this code to a different data structure, it may not have the same number of rows.

Instead of writing scripts in this manner, we can use a “loop.”

What Is a Loop?

A loop is a programming structure that allows us to perform the same task in a repetitive manner. Two types of loops are supported by R: the “for” loop and the “while” loop.

What Is a For Loop?

A “for” loop will perform the same action on each of a pre-specified set of inputs. For example, imagine we have a bag containing 100 potato chips and we have decided we’re going to eat every one. In this case, our “for” loop may be structured as follows:

For each of our 100 chips:
    Reach into the bag
    Remove a single potato chip
    Eat the potato chip

This is a simple repetitive pattern. However, we do need to pre-specify the inputs over which we’re going to iterate. For example, if we didn’t know exactly how many potato chips were in the bag, we cannot use this approach.

What Is a While Loop?

By contrast, a “while” loop allows us to perform the same action in a repeated manner until a condition is met. For example, if we had a bag of potato chips and we wanted to eat the contents, we may write a “while” loop as follows:

While there are still chips left in the bag:
    Reach into the bag
    Remove a single potato chip
    Eat the potato chip

Again, this is a simple structure and will work well in our case. However, we need to be sure no one hands us a bag with an infinite number of potato chips, in which case we’ll never “leave” the loop and just keep on eating.

The for Function

The for function in R allows us to implement a “for” loop. The structure of the loop is as follows:

for (variable in set_of_values) {
  # do this
}

The variable defined will iteratively take each value of the set_of_values, and the body of the “for” loop will then be executed. Here’s an example:

> for (i in 1:5) {
+   cat(" Hello")  # Say Hello
+ }

 Hello
 Hello
 Hello
 Hello
 Hello

In this very simple example, i is iteratively set to each value in vector 1:5 and then the body of the loop is executed—the result is to print the message “Hello” five times.


Note: Using Curly Brackets

In this example, we are using curly brackets to encapsulate the body of code. As with writing functions, we can omit these if the body of code is a single line; therefore, this example could be rewritten as follows:

> for (i in 1:10) cat(" Hello")  # Say Hello

As a convention, and as good practice, we will use curly brackets throughout this hour.


Using the Loop Variable

In the last example, we set i to each value in vector 1:5. If we use i in the body of the loop, we can more easily see this process:

> for (i in 1:5) {
+   cat(" i has been set to the value of", i)
+ }

 i has been set to the value of 1
 i has been set to the value of 2
 i has been set to the value of 3
 i has been set to the value of 4
 i has been set to the value of 5

Let’s look at a slightly different example, this time involving a set of character values over which to iterate:

> for (let in LETTERS[1:5]) {
+   cat(" The Letter", let)
+ }

 The Letter A
 The Letter B
 The Letter C
 The Letter D
 The Letter E

Referencing Data with Loops

For loops are often used to iterate over data sources, performing actions on groupings within that data. Let’s use the internal airquality dataset for this example, which contains air quality measurements for New York from May to September 1973:

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

The Month column stores the month number (May = 5 to September = 9). We can generate a vector of unique month values using the unique function as follows:

> unique(airquality$Month)
[1] 5 6 7 8 9

What if we wanted to report the average Ozone value for each month? Without a loop, we might write code like this:

> # Perform summary for Month 5
> ozoneValues <- airquality$Ozone [ airquality$Month == 5 ] # Subset the data
> theMean <- round(mean(ozoneValues, na.rm = TRUE), 2)      # Calculate the mean
> cat(" Average Ozone for month 5 =", theMean)            # Print the message

 Average Ozone for month 5 = 23.62
>
> # Perform summary for Month 6
> ozoneValues <- airquality$Ozone [ airquality$Month == 6 ] # Subset the data
> theMean <- round(mean(ozoneValues, na.rm = TRUE), 2)      # Calculate the mean
> cat(" Average Ozone for month 6 =", theMean)            # Print the message

 Average Ozone for month 6 = 29.44
>
> # Perform summary for Month 7
> ozoneValues <- airquality$Ozone [ airquality$Month == 7 ] # Subset the data
> theMean <- round(mean(ozoneValues, na.rm = TRUE), 2)      # Calculate the mean
> cat(" Average Ozone for month 7 =", theMean)            # Print the message

 Average Ozone for month 7 = 59.12

Note that the only varying aspect between these sections of code is the Month value itself. Using a for loop, we could iterate over each (unique) month value, calculating summaries specific to that month, as follows:

> for (M in unique(airquality$Month)) {
+   ozoneValues <- airquality$Ozone [ airquality$Month == M ] # Subset the data
+   theMean <- round(mean(ozoneValues, na.rm = TRUE), 2)      # Calculate and round
                                                                the mean
+   cat(" Average Ozone for month", M, "=", theMean)        # Print the message
+ }

 Average Ozone for month 5 = 23.62
 Average Ozone for month 6 = 29.44
 Average Ozone for month 7 = 59.12
 Average Ozone for month 8 = 59.96
 Average Ozone for month 9 = 31.45

In this example, we are iterating over the unique values of Month. We use the iterator variable M to subset the data, saving the result each time as ozoneValues. We then calculate the mean based on this vector and report the result.

Nested Loops

It is possible to perform “nested” loop operations, where we iterate over more than one set of values. For example, let’s again loop through sections of the airquality dataset, but this time report the average values of the Ozone, Wind, and Solar.R columns. We could extend the last loop as follows:

> for (M in unique(airquality$Month)) {
+
+   cat(" Month =", M, " =========")                    # Write Month Number
+   subData <- airquality [ airquality$Month == M, ]          # Subset the data
+
+   theMean <- round(mean(subData$Ozone, na.rm = TRUE), 2)    # Calculate the mean
+   cat("    Average Ozone = ", theMean)                    # Print the message
+
+   theMean <- round(mean(subData$Wind, na.rm = TRUE), 2)     # Calculate the mean
+   cat("    Average Wind = ", theMean)                     # Print the message
+
+   theMean <- round(mean(subData$Solar.R, na.rm = TRUE), 2)  # Calculate the mean
+   cat("    Average Solar.R = ", theMean)                  # Print the message
+
+ }


 Month = 5
 =========
   Average Ozone =        23.62
   Average Wind =           11.62
   Average Solar.R =      181.3

 Month = 6
 =========
   Average Ozone =        29.44
   Average Wind =           10.27
   Average Solar.R =      190.17

 Month = 7
 =========
   Average Ozone =         59.12
   Average Wind =             8.94
   Average Solar.R =      216.48

 Month = 8
 =========
   Average Ozone =          59.96
   Average Wind =              8.79
   Average Solar.R =        171.86

 Month = 9
 =========
   Average Ozone =        31.45
   Average Wind =           10.18
   Average Solar.R =      167.43


Tip: Tab Characters

Note the use in the preceding example. This allows us to insert a “tab” symbol when printing text in this way. For this example, it left-aligns the numeric mean values produced. If we wanted to (more correctly) right-align these numeric values, we could additionally call the format function to convert the numeric values to a nicely formatted character output.


We could instead iterate over values of Month and then iterate over the columns within Month using a nested loop, as follows:

> for (M in unique(airquality$Month)) {
+
+   cat(" Month =", M, " =========")                   # Write Month Number
+   subData <- airquality [ airquality$Month == M, ]         # Subset the data
+
+   for (column in c("Ozone", "Wind", "Solar.R")) {          # Iterate over columns
+     theMean <- round(mean(subData[[column]], na.rm = TRUE), 2)   # Calculate the
                                                                     mean
+     cat("    Average", column, "= ", theMean             # Print the message
+   }
+
+ }


 Month = 5
 =========
   Average Ozone =         23.62
   Average Wind =   11.62
   Average Solar.R =       181.3

 Month = 6
 =========
   Average Ozone =         29.44
   Average Wind =   10.27
   Average Solar.R =      190.17

 Month = 7
 =========
   Average Ozone =          59.12
   Average Wind =   8.94
   Average Solar.R =       216.48

 Month = 8
 =========
   Average Ozone =          59.96
   Average Wind =   8.79
   Average Solar.R =       171.86

 Month = 9
 =========
   Average Ozone =          31.45
   Average Wind =   10.18
   Average Solar.R =       167.43


Note: Referencing Columns

Note that we used the double square brackets notation here as opposed to the $ syntax in the more verbose example. This is because we can’t parameterize values used by $, as shown in this example:

> airquality$Wind[1:5]    # The Wind column
[1]  7.4  8.0 12.6 11.5 14.3
> airquality$"Wind"[1:5]  # Also works
[1]  7.4  8.0 12.6 11.5 14.3
> whichColumn <- "Wind"   # set value of whichColumn
> airquality$whichColumn  # Reference using whichColumn
NULL

We must therefore use a double square bracket notation (or alternatively the [ , whichColumn] notation) that was introduced in Hour 4, “Multi-Mode Data Structures”.



Note: Loop Performance

Later, in Hour 18, “Code Efficiency,” we will look again at loops and discuss performance and efficiency gains.


Looping through data frames in this way is generally not recommended. As we will see shortly, and again in Hour 12, “Efficient Data Handling in R,” there are many simpler, faster ways to loop through columns or rows in a data frame. However the concept of a for loop is a much more widely applicable programming concept that can help clean up repetitive, unmaintainable code.

The while Function

The while function in R allows us to implement a “while” loop. The structure of the “while” loop is as follows:

while (condition) {
  # do this
}

The result is that the loop will iterate constantly until the condition is no longer TRUE. Of course, if the condition is always TRUE, the loop will never stop iterating, so we need to exercise caution.

Let’s look at a simple example:

> index <- 1              # Set value of index to 1
> while(index < 6) {
+   cat(" Hello")       # Write a message
+   index <- index + 1    # Update the value of index
+ }

 Hello
 Hello
 Hello
 Hello
 Hello

Here, we initially set the value of index to 1. Then, we iteratively write a simple message and increment index. The loop continues to iterate until the condition (index < 6) is no longer true.

We can see this more clearly by improving the message produced:

> index <- 1                                            # Set value of index to 1
> while(index < 6) {
+   cat(" Setting the value of index from", index)    # Write a message
+   index <- index + 1                                  # Update the value of index
+   cat(" to", index)                                   # Write a message
+ }

 Setting the value of index from 1 to 2
 Setting the value of index from 2 to 3
 Setting the value of index from 3 to 4
 Setting the value of index from 4 to 5
 Setting the value of index from 5 to 6

The “apply” Family of Functions

The majority of functions in R are relatively simple and designed to work with single-mode structures. Consider, for example, the median function, which can be used to calculate the median of a numeric data object (typically a vector). Let’s have a look at the arguments of the function and a simple example:

> args(median)
function (x, na.rm = FALSE)
NULL
> median( airquality$Wind )  # Median of Wind column
[1] 9.7

We can see that median has two arguments (x and na.rm), which can be used to specify the values for which the median is to be calculated, and a logical value specifying whether missing values should be removed before calculating the median.

What if we wanted to apply this function in a more sophisticated way? Here are some examples:

Image The median of rows or columns of a matrix

Image The median of each element of a list

Image The median of some variable for each level of one or more grouping variables (for example, median sales by age group)

As you have seen earlier in this hour, the loop structure provides a way to iteratively call a function (for example, on subsections of a data object). Although we could apply a function using loops, much of our code would be needed just to reference the subsections of the data we need given the values over which we’re iterating (as you saw previously).

Instead, R provides a set of functions (the “apply” family of functions) that offer a more natural structure for applying simple functions to data structures in a more sophisticated way.

The Set of “apply” Functions

In R, many functions could be considered part of the “apply” family of functions. Let’s start by looking at the set of functions in R of the form “xapply,” where x is an optional letter, using the apropos function:

> apropos("^[a-z]?apply$") # Find all objects ending in "apply"
[1] "apply"  "eapply" "lapply" "mapply" "rapply"
[6] "sapply" "tapply" "vapply"


Note: Other Functions in the “apply” Family

We could conceivably include functions such as by and aggregate in the “apply” family given their aims and usage. We’ll cover aggregate in Hour 11, “Data Manipulation and Transformation,” but will not cover by in this book given the numerous better ways of performing the tasks by enables.



Tip: Regular Expressions

As seen in the apropos call, the regular expression capabilities of R are very useful for looking for patterns in vectors of characters.


The call to apropos returns eight functions, which are listed in Table 9.1.

Image

TABLE 9.1 Set of “apply” Functions

For now, let’s focus on the first four functions listed in Table 9.1 (apply, lapply, sapply and tapply).

The apply Function

The apply function allows us to apply a function over dimensions of a data object. Acceptable inputs to apply include any object that has a “dimension”—for example, matrices, data frames, and arrays. The arguments to the apply function are as follows:

> args(apply)
function (X, MARGIN, FUN, ...)
NULL

Table 9.2 details the arguments of the apply function.

Image

TABLE 9.2 Arguments to the apply Function

The “Margin”

The second argument, the “Margin,” specifies the “dimension number” over which to apply the function, as described in Table 9.3.

Image

TABLE 9.3 Margin Values

We typically specify the margin as a single integer value or vector of integer values.


Note: Named Dimensions

If your structure has dimension names assigned, a character vector can be provided instead of the (more commonly used) vector of integers.


A Simple apply Example

The apply function is best described with a simple example. First, let’s create a structure that has dimensions:

> myMat <- matrix(rpois(20, 3), nrow = 4)  # Create a simple matrix
> myMat                                    # Print myMat
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    4    2    2
[2,]    1    7    3    1    6
[3,]    2    3    0    3    4
[4,]    2    2    4    3    4
> dim(myMat)                               # Dimensions of myMat
[1] 4 5

Now let’s use our first call to apply. In this example, we’ll calculate the maximum of each column (dimension 2) of our matrix:

> apply(myMat, 2, max)   # Column Maxima
[1] 5 7 4 3 6

The result is a vector that holds the maximum of each column (for example, we see that the maximum of the values in the second column is 7).


Note: The Use of Random Numbers

In this and the following sections I use functions such as rpois to generate random samples. Since these are random draws, they will not necessarily match your results if you run the same code.


The apply function operates by “breaking apart” the structure based on the margin(s) provided and then applying the function to each “piece” of the partitioned structure. In this example, the matrix is split into separate columns with the max function applied to each column, as illustrated in Figure 9.1.

Image

FIGURE 9.1 A visual demonstration of the apply function calculating column maxima

Now let’s look at another simple example—this time we’ll calculate the minimum of each row (dimension 1) of our matrix:

> apply(myMat, 1, min)   # Row Minima
[1] 2 1 0 2

Again, the result is a vector, this time containing the minimum of each row of the matrix (so the minimum value in row 3 is 0). This time, the apply function “breaks apart” the structure by rows and applies the min function to each “piece” of the structure, as illustrated in Figure 9.2.

Image

FIGURE 9.2 A visual demonstration of the apply function calculating row minima

Using Multiple Margins

In these simple examples, we specified a single margin in each call (1 for rows or 2 for columns). We can, instead, use multiple margins, as shown here:

> myMat
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    4    2    2
[2,]    1    7    3    1    6
[3,]    2    3    0    3    4
[4,]    2    2    4    3    4

> apply(myMat, c(1, 2), median)   # Median by row AND column
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    4    2    2
[2,]    1    7    3    1    6
[3,]    2    3    0    3    4
[4,]    2    2    4    3    4

In this example, we’ve calculated the median value by row and column by specifying two values for the margin (1 and 2). This calculates the median of each cell of the matrix (that is, the median of “5” is “5”) and thus returns exactly the same matrix that we started with. This process is visualized in Figure 9.3.

Image

FIGURE 9.3 A visual demonstration of the apply function performing cell calculations

Although this is not of any practical use, it does further illustrate the way the apply function works.

Using apply with Higher Dimension Structures

Although using multiple margins may not be useful for two-dimensional structures (that is, matrices or data frames), when we deal with structures with a higher number of dimensions it can be useful. To illustrate this, let’s create a three-dimensional array:

> myArray <- array( rpois(18, 3), dim = c(3, 3, 2)) # Create array
> myArray                                           # Print myArray
, , 1

     [,1] [,2] [,3]
[1,]    2    2    4
[2,]    4    3    1
[3,]    4    1    1

, , 2

     [,1] [,2] [,3]
[1,]    0    6    3
[2,]    4    3    1
[3,]    1    5    1

> dim(myArray)                                      # Dimensions of myArray
[1] 3 3 2

Now, there are three dimensions over which we could apply our functions. Let’s try to apply a function over dimension 3 of the array:

> apply(myArray, 3, min)
[1] 1 0

Here, the array is first broken apart based on dimension 3, resulting in 2×2-dimensional structures. The min function is then applied to each of the two structures, as illustrated in Figure 9.4.

Image

FIGURE 9.4 The apply function operating over the third dimension of an array

Instead, we could provide multiple margins. For example, let’s apply the max function, this time over dimensions 1 and 2:

> apply(myArray, c(1, 2), max)
     [,1] [,2] [,3]
[1,]    2    6    4
[2,]    4    3    1
[3,]    4    5    1

This time the structure is “collapsed” over the third dimension, producing a matrix of outputs. This process is illustrated in the Figure 9.5.

Image

FIGURE 9.5 The apply function operating over the first and second dimensions of an array

Passing Extra Arguments to the “applied” Function

Let’s return to our matrix example, but this time insert a missing value:

> myMat[2, 2] <- NA   # Add a missing value in cell 2, 2
> myMat               # Print the matrix
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    4    2    2
[2,]    1   NA    3    1    6
[3,]    2    3    0    3    4
[4,]    2    2    4    3    4

Now, let’s once again apply a function. For example, let’s calculate the maximum of each column (dimension 2) of the matrix:

> apply(myMat, 2, max)  # Maximum of each column
[1]  5 NA  4  3  6

This time, our output contains a missing value. The reason for this is that when the second column is passed into the max function, the missing value causes the max function to return an NA value. This is illustrated in Figure 9.6.

Image

FIGURE 9.6 The use of apply with missing values

We can also see this behavior directly by calculating the maximum of the second column:

> max(myMat[,2])  # Maximum of 2nd column
[1] NA

As you saw earlier, functions such as max have a na.rm argument, which allows us to specify that missing values are removed before performing the calculation:

> max(myMat[,2], na.rm = TRUE)  # Maximum of 2nd column
[1] 6

If we want to call a function but also pass additional arguments, we can take advantage of the ellipsis argument to apply, as follows:

> args(apply)                          # Ellipsis is 4th argument
function (X, MARGIN, FUN, ...)
NULL
> apply(myMat, 2, max, na.rm = TRUE)   # Maximum of each column
[1] 5 6 4 3 6

As you can see, the max function is now called with the argument na.rm set to TRUE, so the maximum of the (nonmissing) values of column 2 is now reported.

We can pass as many additional arguments as we need. For example, let’s calculate the quantiles of a slightly larger matrix using the quantile function:

> biggerMat <- matrix( rpois(300, 3), ncol = 3)   # Create a 100 x 3 matrix

> head(biggerMat)                                 # First few rows
     [,1] [,2] [,3]
[1,]    4    2    3
[2,]    5    3    5
[3,]    4    7    1
[4,]    5    3    3
[5,]    3    3    4
[6,]    1    5    4

> apply(biggerMat, 2, quantile)                   # Column quantiles
     [,1] [,2] [,3]
0%      0    0    0
25%     2    2    2
50%     3    3    3
75%     4    4    4
100%    8    8    8

Now, let’s artificially add a number of missing values; therefore, we need to pass the extra na.rm argument to quantile:

> biggerMat [ sample( 1:300, 50) ] <- NA          # Randomly add some missings

> head(biggerMat)                                 # First few rows
     [,1] [,2] [,3]
[1,]    4    2   NA
[2,]    5    3   NA
[3,]    4    7    1
[4,]    5    3    3
[5,]   NA   NA    4
[6,]    1   NA    4

> apply(biggerMat, 2, quantile, na.rm = TRUE)     # Column quantiles
     [,1] [,2] [,3]
0%      0    0    0
25%     2    2    1
50%     3    3    3
75%     4    4    4
100%    8    8    8

The quantile function has an argument, probs, that allows us to specify that a different set of quantiles are returned. Let’s additionally pass the probs argument to specify some new quantiles:

> apply(biggerMat, 2, quantile,
+   probs = c(0, .05, .5, .95, 1), na.rm = TRUE)     # Column quantiles
     [,1] [,2] [,3]
0%      0 0.00    0
5%      0 1.00    1
50%     3 3.00    3
95%     6 6.15    6
100%    8 8.00    8

Using apply with Our Own Functions

So far in this hour, we have used simple functions to illustrate the use of the apply function (for example, row minima, column maxima). There are in fact several utility functions designed for this very purpose, for example rowMeans, colMeans, rowSums, and colSums. However, we can also create our own functions and “apply” those over dimensions instead.

Consider the matrix we created earlier:

> myMat
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    4    2    2
[2,]    1    7    3    1    6
[3,]    2    3    0    3    4
[4,]    2    2    4    3    4

Let’s imagine we want to count the number of values in each column that are greater than 3. There isn’t currently a function in R that will return “the number of values greater than 3,” so let’s create one:

> above3 <- function(vec) {
+   sum(vec > 3)
+ }
> above3( c(1, 6, 5, 1, 2, 3) )  # Try out our function
[1] 2

In the same way as before, we can now “apply” this function across dimensions of our matrix. So to calculate the number of values in each column that are greater than 3, we use the following code:

> apply(myMat, 2, above3)   # Number of values > 3 in each column
[1] 1 2 2 0 3

In this example, we created the function above3 and “applied” it to our structure. If we wanted to use above3 for other uses, this is fine. However, if this is only something we want to do once, we can define the function directly in the apply call (so it is never created as an R object in our session). To achieve this, we replace the function object with the definition as follows:

> apply(myMat, 2, function(vec) {
+   sum(vec > 3)
+ })
[1] 1 2 2 0 3


Tip: One-Line Function Definitions

As before, we can omit the {} (curly brackets) if our function can be defined on a single line. As such, the preceding code could be rewritten as follows:

> apply(myMat, 2, function(vec) sum(vec > 3))
[1] 1 2 2 0 3

As a convention, we will use the curly brackets consistently throughout this hour.


Passing Extra Arguments to Our Functions

As shown earlier, if we want to pass additional arguments, we can list them after the function call. We can do the same for the functions we write. For example, let’s update our function with a second argument to control the threshold value for counting:

> aboveN <- function(vec, N) {
+   sum(vec > N)
+ }
> someValues <- c(1, 6, 5, 1, 2, 3)
> aboveN( someValues, N = 3 )        # Number > 3
[1] 2
> aboveN( someValues, N = 5 )        # Number > 4
[1] 1

If we “apply” this function to columns of our matrix, we need to additionally pass the N argument:

> myMat                             # Print the matrix
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    4    2    2
[2,]    1    7    3    1    6
[3,]    2    3    0    3    4
[4,]    2    2    4    3    4
> apply(myMat, 2, aboveN, N = 3)    # Number > 3
[1] 1 2 2 0 3
> apply(myMat, 2, aboveN, N = 4)    # Number > 4
[1] 1 2 0 0 1

If, instead, we want to define the function directly in the apply call, we would need to list the additional arguments after the definition itself:

> apply(myMat, 2, function(vec, N) {
+   sum(vec > N)
+ }, N = 3)
[1] 1 2 2 0 3

Applying to Data Frames

Throughout this hour, we have used single-mode structures (matrices and arrays) as sample inputs to the apply function. However, because we can use any structure that has a dimension, we could also use apply with data frames. As an example, let’s “apply” the median function to columns of the airquality data frame:

> head(airquality)                              # First few rows
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

> apply(airquality, 2, median, na.rm = TRUE)    # Median of each column
  Ozone Solar.R    Wind    Temp   Month     Day
   31.5   205.0     9.7    79.0     7.0    16.0

This command returns the median of each column (although, perhaps the “median Month” and “median Day” are not that useful). Now let’s consider a second example, this time using the iris data frame:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> apply(iris, 2, median, na.rm = TRUE)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
          NA           NA           NA           NA           NA
Warning messages:
1: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
3: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
4: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
5: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA

This time, the output returns missing values along with a number of warning messages—but why is this?

When we apply functions over dimensions of single-mode structures (for example, matrices and arrays), we know the “mode” of data being passed to our function is the same each time it is called (that is, if we have a numeric matrix, we know that each column will necessarily be numeric).

By comparison, a data frame is a multi-mode structure, so each column may (or may not) be of the same mode. When we call “apply,” R will first break the data and store it in a single-mode structure—at this point, all the data is coerced to a single mode, which may or may not be a suitable input to the function.

With the airquality example, the apply function first structures the data into a single-mode (numeric) object and then applies the median function to each (numeric) column. With the iris data frame, the Species column is not a numeric column, so when the data is structured into a single-mode object, the resulting data is no longer numeric. We can see this in the following call, where we query the class of each column of the data:

> apply(iris, 2, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
 "character"  "character"  "character"  "character"  "character"

So, when R then attempts to apply the median function to each column, the missing values and warning messages are produced.

So, in summary, we can use apply with data frames, but we have to take care that data over which we’re “applying” can be adequately combined into a single mode. For example, if we wanted to calculate the mean of all numeric columns of iris, we could use this approach:

> # Apply median function over the first 4 columns of iris
> apply(iris[,-5], 2, median, na.rm = TRUE)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
        5.80         3.00         4.35         1.30

The lapply Function

The lapply function applies functions to each element of a list and always returns a list structure as its output. For example, let’s create a list of numeric vectors and calculate the median of each element. First, we’ll create the list:

> myList <- list(P1 = rpois(10, 1), P3 = rpois(10, 3), P5 = rpois(10, 5))
> myList
$P1
 [1] 1 2 2 2 1 0 0 1 1 4

$P3
 [1] 0 1 4 0 2 3 2 2 1 6

$P5
 [1] 5 4 9 6 6 4 6 5 3 5

To use the lapply function, we simply pass the list and the function to apply (there is no “margin” here because the data is already “split” into list elements):

> lapply(myList, median)
$P1
[1] 1

$P3
[1] 2

$P5
[1] 5

The split Function

In the preceding example, the lapply call itself was actually a lot simpler (and more concise) than the code used to create the sample list. In a slight departure, let’s quickly look at a simple function that creates lists (which we could then use as examples in lapply). This function is called split.

The split function divides a data structure into separate parts based on one or more grouping variables. The output from a split is a list. As a first example, let’s split the Wind column from airquality based on levels of Month. We can achieve that by calling split with the Wind column as the first input and the “grouping” column (Month) as the second argument. Note that the output is a list:

> spWind <- split(airquality$Wind, airquality$Month)
> $`5`
 [1]  7.4  8.0 12.6 11.5 14.3 14.9  8.6 13.8 20.1  8.6  6.9  9.7  9.2
[14] 10.9 13.2 11.5 12.0 18.4 11.5  9.7  9.7 16.6  9.7 12.0 16.6 14.9
[27]  8.0 12.0 14.9  5.7  7.4

$`6`
 [1]  8.6  9.7 16.1  9.2  8.6 14.3  9.7  6.9 13.8 11.5 10.9  9.2  8.0
[14] 13.8 11.5 14.9 20.7  9.2 11.5 10.3  6.3  1.7  4.6  6.3  8.0  8.0
[27] 10.3 11.5 14.9  8.0

$`7`
 [1]  4.1  9.2  9.2 10.9  4.6 10.9  5.1  6.3  5.7  7.4  8.6 14.3 14.9
[14] 14.9 14.3  6.9 10.3  6.3  5.1 11.5  6.9  9.7 11.5  8.6  8.0  8.6
[27] 12.0  7.4  7.4  7.4  9.2

Given that this structure is a list, it is a suitable input to the lapply function. Let’s calculate the median value of each element of spWind:

> lapply(spWind, median)
$`5`
[1] 11.5

$`6`
[1] 9.7

$`7`
[1] 8.6

$`8`
[1] 8.6

$`9`
[1] 10.3

This result is, therefore, the median Wind value for each level of Month, or the “median Wind by Month.”


Note: Nested Calls to lapply and split

In the preceding example, we separated the split and lapply calls for clarity. We could, of course, combine them into a single call, as follows:

> lapply(split(airquality$Wind, airquality$Month), median)

Or

> with(airquality, lapply(split(Wind, Month), median))


Splitting Data Frames

In the preceding example, we split a vector based on levels specified in another vector. The split function can also be used to divide data frames. For example, let’s split our airquality data based on Month:

> spAir <- split(airquality, airquality$Month)  # Split the data

> length(spAir)                                 # Length of list
[1] 5
> names(spAir)                                  # Element names
[1] "5" "6" "7" "8" "9"

> head(spAir[[1]])                              # First element
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

As you can see, this creates a list of length 5 where each element contains a data frame containing data for only one month. Now let’s use lapply to apply a function to each data frame stored in this list. We need to apply a function that will perform an operation on a data frame, so let’s return the first three rows in each element of the list using head:

> lapply(spAir, head, n = 3)
$`5`
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3

$`6`
   Ozone Solar.R Wind Temp Month Day
32    NA     286  8.6   78     6   1
33    NA     287  9.7   74     6   2
34    NA     242 16.1   67     6   3

$`7`
   Ozone Solar.R Wind Temp Month Day
62   135     269  4.1   84     7   1
63    49     248  9.2   85     7   2
64    32     236  9.2   81     7   3

$`8`
   Ozone Solar.R Wind Temp Month Day
93    39      83  6.9   81     8   1
94     9      24 13.8   81     8   2
95    16      77  7.4   82     8   3

$`9`
    Ozone Solar.R Wind Temp Month Day
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3

Perhaps instead we could lapply our own function to each data frame. For example, let’s create a function that calculates column means for the Ozone, Solar.R, Wind, and Temp variables:

> lapply(spAir, function(df) {
+   apply(df[,1:4], 2, median, na.rm = TRUE)
+ })
$`5`
  Ozone Solar.R    Wind    Temp
   18.0   194.0    11.5    66.0

$`6`
  Ozone Solar.R    Wind    Temp
   23.0   188.5     9.7    78.0

$`7`
  Ozone Solar.R    Wind    Temp
   60.0   253.0     8.6    84.0

$`8`
  Ozone Solar.R    Wind    Temp
   52.0   197.5     8.6    82.0

$`9`
  Ozone Solar.R    Wind    Temp
   23.0   192.0    10.3    76.0

Here, each element of spAir is passed into the function we defined as input: df. Then, for each df, we calculate the column means of the first four columns.


Note: Splitting on Multiple Variables

You’ve seen that the split function can be used to divide data structures (such as vectors or data frames) into elements of a list based on values of another vector. We can split by more than one variable by passing a list of factors:

> split(airquality$Wind, list(airquality$Month, cut(airquality$Temp, 3)))
$`5.(56,69.7]`
 [1]  7.4 11.5 14.3 14.9  8.6 13.8 20.1  8.6  9.7  9.2 10.9 13.2 11.5 12.0 18.4 11.5  9.7
[18]  9.7  9.7 12.0 16.6 14.9  8.0 12.0

$`6.(56,69.7]`
[1] 16.1  9.2

$`7.(56,69.7]`
numeric(0)
...

This could then be passed to lapply to calculate summaries by more than one grouping variable.


Using lapply with Vectors

At the start of this section, we said that the lapply function will apply a function to each element of a list. However, if we instead pass a vector to the lapply function, it will convert it to a list using the as.list function as follows:

> as.list(1:5)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 5

That means we can use lapply to apply a function to each element of a vector. Let’s consider a simple example, where we apply the rnorm function to values 1 to 5:

> lapply(1:5, rnorm)
[[1]]
[1] 0.8168998

[[2]]
[1] -0.8863575 -0.3315776

[[3]]
[1] 1.1207127 0.2987237 0.7796219

[[4]]
[1]  1.4557851 -0.6443284 -1.5531374 -1.5977095

[[5]]
[1]  1.8050975 -0.4816474  0.6203798  0.6121235 -0.1623110

This is equivalent to the following:

> list(
+   rnorm(1),
+   rnorm(2),
+   rnorm(3),
+   rnorm(4),
+   rnorm(5)
+ )
[[1]]
[1] 0.8118732

[[2]]
[1] 2.196834 2.049190

[[3]]
[1] 1.6324456 0.2542712 0.4911883

[[4]]
[1] -0.32408658 -1.66205024  1.76773385  0.02580105

[[5]]
[1]  1.1285108 -2.3803581 -1.0602656  0.9371405  0.8544517

Let’s add a second argument to rnorm. For example, let’s specify a mean for the Normal distribution:

> lapply(1:5, rnorm, mean = 10)
[[1]]
[1] 11.46073

[[2]]
[1]  8.586901 10.567403

[[3]]
[1] 10.583188  8.693201  9.459614

[[4]]
[1] 11.947693 10.053590 10.351663  9.329023

[[5]]
[1] 10.277954 10.691171 10.823795 12.145065  7.653056

The Order of “apply” Inputs

When the lapply function (like all “apply” functions) passes the data to the function, the data is passed as the first input and is not named. So, the last example is equivalent to this:

> list(
+   rnorm(1, mean = 10),
+   rnorm(2, mean = 10),
+   rnorm(3, mean = 10),
+   rnorm(4, mean = 10),
+   rnorm(5, mean = 10)
+ )
[[1]]
[1] 10.14959

[[2]]
[1]  8.657469 10.553303

[[3]]
[1] 11.589963  9.413120  8.167623

[[4]]
[1] 10.888139 11.593488 10.516855  8.704328

[[5]]
[1] 10.054616  9.215351  8.950647 12.330512 11.402705

Let’s quickly remind ourselves of the arguments of rnorm:

> args(rnorm)
function (n, mean = 0, sd = 1)
NULL

The first argument to rnorm, the number of values to sample, is called n. Although the lapply function is not “naming” the first input, the order-based method for specifying arguments in a function means that it is this “n” input that accepts each of the values, 1 to 5. What if we, instead, specify the first argument (n) as an extra parameter?

> lapply(1:5, rnorm, n = 5)
[[1]]
[1] 1.9426009 1.8262583 0.1884595 1.4762483 2.0212584

[[2]]
[1] 2.645383 3.043144 1.695631 4.477111 2.971221

[[3]]
[1] 4.867099 3.672042 2.692047 3.536524 3.824870

[[4]]
[1] 3.036099 3.144917 5.886947 3.608181 3.019367

[[5]]
[1] 5.687332 4.494956 7.157720 4.400202 4.305453

This produces a slightly different output, where each element of the list is a sample of five values from a Normal distribution. Here, the lapply call is equivalent to the following:

> list(
+   rnorm(1, n = 5),
+   rnorm(2, n = 5),
+   rnorm(3, n = 5),
+   rnorm(4, n = 5),
+   rnorm(5, n = 5)
+ )
[[1]]
[1]  1.2239254 -0.1562233  1.4224185 -0.3247553  1.1410843

[[2]]
[1] 1.463952 1.688394 3.556110 1.551967 2.321124

[[3]]
[1] 1.769828 1.675941 4.261242 4.319232 2.919246

[[4]]
[1] 3.494910 3.947846 4.628861 6.180002 3.930983

[[5]]
[1] 6.544864 6.321452 5.322152 6.530955 4.578760

In this case, we are explicitly naming the “n” input and setting it to 5, which explains why five samples are being returned in each list element. Therefore, the values we pass to the function (1 to 5) are instead used as the second input: the mean of the distribution from which to sample. In other words, this code returns the following:

Image Five samples from a Normal distribution with mean 1

Image Five samples from a Normal distribution with mean 2

Image Five samples from a Normal distribution with mean 3

Image Five samples from a Normal distribution with mean 4

Image Five samples from a Normal distribution with mean 5

As a natural extension, if we specify the n and mean inputs, then each value of 1 to 5 will move to the third argument (the standard deviation).

Using lapply with Data Frames

As you saw in Hour 4, data frames are structured as lists of vectors. Therefore, we can use lapply to apply functions to each column of a data frame as follows:

> lapply(airquality, median, na.rm = TRUE)
$Ozone
[1] 31.5

$Solar.R
[1] 205

$Wind
[1] 9.7

$Temp
[1] 79

$Month
[1] 7

$Day
[1] 16

This is a similar process to using apply to apply functions over columns of a data frame. The two primary differences are as follows:

Image The lapply function always returns a list.

Image When using apply, the structures are first put into a single-mode structure before processing, whereas the lapply function does not attempt to combine columns between processing.

The last point here can be illustrated by the following example, where we look at the class of each column in our data frame:

> apply(airquality, 2, class)
    Ozone   Solar.R      Wind      Temp     Month       Day
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
> lapply(airquality, class)
$Ozone
[1] "integer"

$Solar.R
[1] "integer"

$Wind
[1] "numeric"

$Temp
[1] "integer"

$Month
[1] "integer"

$Day
[1] "integer"

Note that, by the time the class function is applied in our first example, the apply function has already structured the data into a single-mode structure (so all data is forced to be of the same mode). With lapply, this coercion is not done, so we see instances of “numeric” (the “Wind” column) and “integer” column classes reported.

The sapply Function

The sapply function is a simple wrapper for the lapply function. In fact, the call to lapply can be clearly seen on the second line of the sapply function body:

> sapply
function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
{
    FUN <- match.fun(FUN)
    answer <- lapply(X = X, FUN = FUN, ...)
    if (USE.NAMES && is.character(X) && is.null(names(answer))) names(answer) <- X
    if (!identical(simplify, FALSE) && length(answer))
        simplify2array(answer, higher = (simplify == "array"))
    else answer
}

Therefore, as with lapply, the sapply function allows us to apply functions to elements of a list (or vector). The primary difference is that, whereas lapply always returns a list, sapply will (by default) attempt to simplify the return object using the simplify2array function.

To illustrate this, let’s look back at an earlier example where we use lapply and split to calculate the median values of Wind by Month:

> lapply(split(airquality$Wind, airquality$Month), median)
$`5`
[1] 11.5

$`6`
[1] 9.7

$`7`
[1] 8.6

$`8`
[1] 8.6

$`9`
[1] 10.3

If we replace the lapply function with the sapply function, we get a simpler output (in this case, a named vector):

> sapply(split(airquality$Wind, airquality$Month), median)
   5    6    7    8    9
11.5  9.7  8.6  8.6 10.3

For another example, let’s use sapply to see the class of each column of the iris data frame:

> sapply(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"

Returns from sapply

The return values from sapply can sometimes be rather unpredictable. That is because sapply will attempt to simplify the return structure (which may result in a nicely formatted structure) but is often not able to simplify the return (in which case it stays as a list). Table 9.4 summarizes the return values, which depend on the number of values returned from the “applied” function.

Image

TABLE 9.4 Return Values from sapply

Some examples showing the various return objects are provided here:

> myList <- list(P1 = rpois(5, 1), P3 = rpois(5, 3), P5 = rpois(5, 5))
>
> # Function that (always) returns a single value > vector output
> sapply(myList, median)
P1 P3 P5
 1  3  4

> # Function that (always) returns 2 values > matrix output
> sapply(myList, range)
     P1 P3 P5
[1,]  0  1  3
[2,]  3  4  6

> # Function that (always) returns 5 values > matrix output
> sapply(myList, quantile)
     P1 P3 P5
0%    0  1  3
25%   0  3  4
50%   1  3  4
75%   2  3  5
100%  3  4  6

> # Function that can return a variable number of values > list output
> sapply(myList, function(X) X [ X > 2 ])
$P1
[1] 3

$P3
[1] 3 3 3 4

$P5
[1] 3 5 4 4 6

> # Function that can return a variable number of values
> # BUT it happens that the return values are of the same
> # length in this instance > simplification occurs
> sapply(myList, function(X) min(X):max(X))
     P1 P3 P5
[1,]  0  1  3
[2,]  1  2  4
[3,]  2  3  5
[4,]  3  4  6

Why Not Just Stick with sapply?

At this point, you may be wondering why we’d ever need to use lapply given that sapply returns a “simpler” output.

The key reason for using lapply instead of sapply is that you always know a list will be returned, whereas the returns from sapply can be unpredictable, particularly when the function applied can return a variable number of values (as seen previously). When we write code, we need to be sure of the structure returned so we can write code to deal with that structure—for example, imagine writing a script where you expect the return output from an sapply call to be a list, but then it is unexpectedly simplified to an array (as seen in the last example).

More generally, there may be times when you explicitly don’t want to try and simplify the output. Consider a situation where we have a list containing two matrices:

> matList <- list(
+   P3 = matrix( rpois(8, 3), nrow = 2),
+   P5 = matrix( rpois(8, 5), nrow = 2)
+ )
> matList
$P3
     [,1] [,2] [,3] [,4]
[1,]    8    1    1    4
[2,]    4    2    8    2

$P5
     [,1] [,2] [,3] [,4]
[1,]    5    4    3    2
[2,]    1    7    7    1

Now let’s use our lapply and sapply functions to extract the first row of each matrix:

> lapply(matList, head, 1)
$P3
     [,1] [,2] [,3] [,4]
[1,]    8    1    1    4

$P5
     [,1] [,2] [,3] [,4]
[1,]    5    4    3    2

> sapply(matList, head, 1)
     P3 P5
[1,]  8  5
[2,]  1  4
[3,]  1  3
[4,]  4  2

As you can see, the lapply function has returned a list, whereas the sapply function has simplified the output by combining the results into a single (matrix) structure. If these two matrices were measurements on two different systems, we may want to ensure the results are analyzed separately, so combining them into a single structure is not desirable.

The tapply Function

The tapply function allows us to apply a function to elements of a vector, grouped by levels of one or more other variables. The primary arguments to tapply are described in Table 9.5.

Image

TABLE 9.5 The Primary Arguments of tapply

Let’s look at a simple example of tapply used to calculate the median Wind by Month using the airquality data:

> tapply(airquality$Wind, airquality$Month, median)
   5    6    7    8    9
11.5  9.7  8.6  8.6 10.3

As you can see, in this case tapply returns a named vector of values, containing the median Wind values by Month.


Note: Similarity to split + sapply

This is very similar to an earlier example using sapply and split:

> sapply(split(airquality$Wind, airquality$Month), median)
   5    6    7    8    9
11.5  9.7  8.6  8.6 10.3

In fact, tapply is primarily a wrapper for a call to the split and sapply (technically, lapply with a simplify step) functions.


Multiple Grouping Variables

We can specify more than one grouping variable by which to process the data—this is achieved by providing a list of factors instead of a single factor. Let’s calculate the median Wind by Month and grouped Temp (which we’ll create using the cut function):

> tapply(airquality$Wind,
+        list(airquality$Month, cut(airquality$Temp, 3)), median)
  (56,69.7] (69.7,83.3] (83.3,97]
5     11.50         8.0        NA
6     12.65         9.7       9.2
7        NA         9.2       7.4
8        NA        10.3       7.4
9     12.05        10.3       6.0

The return from this function is a matrix with the levels of the first grouping variable (Month) set as the rows (dimension 1) and the levels of the second grouping variable (Temp) in columns (dimension 2).


Caution: Missing Values in Return Structure

In the preceding example, a number of missing values have been returned. Usually when we see a missing value, it presents a value that “exists” but one we do not know. Consider the missing value for high temperature values in Month 5 in this example. It is difficult to know whether this value is generated because

Image There were Wind values in Month 5 for high temperatures, but they contained missing values so we do now know the median value.

Image There were actually no values in Month 5 for high temperatures (that is, there is no data).

In fact, in this case, the latter is true—there were no days in Month 5 when the temperature went above 83.3 degrees Fahrenheit. So, this missing value represents a “lack” of data. However, care should be taken when interpreting the results.


Let’s extend this example a little further, calculating the median Wind by Month levels of Temp and levels of Solar.R:

> tapply(airquality$Wind,
+        list(airquality$Month, cut(airquality$Temp, 3), cut(airquality$Solar.R, 2)),
+        median)
, , (6.67,170]

  (56,69.7] (69.7,83.3] (83.3,97]
5     12.60        10.3        NA
6      9.20         8.0        NA
7        NA         8.6     11.45
8        NA         9.7      8.60
9     13.45        10.3      7.40

, , (170,334]

  (56,69.7] (69.7,83.3] (83.3,97]
5     10.90       11.15        NA
6     16.10       12.65       9.2
7        NA        9.70       7.4
8        NA       10.90       8.0
9     12.05       10.30       4.6

This now creates a three-dimensional array of output, where each of our three grouping variables is aligned to a dimension.

Multiple Returns

In the preceding example, we used the median function to illustrate the use of tapply, which will always return a single value. If, instead, our function returns multiple values, the outputs from tapply can be unexpected and, occasionally, highly complex. Let’s start with a simple example, this time calculating quantiles of Wind values by Month:

> tapply(airquality$Wind, airquality$Month, quantile)
$`5`
   0%   25%   50%   75%  100%
 5.70  8.90 11.50 14.05 20.10

$`6`
  0%  25%  50%  75% 100%
 1.7  8.0  9.7 11.5 20.7

$`7`
  0%  25%  50%  75% 100%
 4.1  6.9  8.6 10.9 14.9

$`8`
  0%  25%  50%  75% 100%
 2.3  6.6  8.6 11.2 15.5

$`9`
    0%    25%    50%    75%   100%
 2.800  7.550 10.300 12.325 16.600

We can see that, with multiple return values, no simplification is performed and a list is returned. This is the equivalent of the following:

> lapply(split(airquality$Wind, airquality$Month), quantile)
$`5`
   0%   25%   50%   75%  100%
 5.70  8.90 11.50 14.05 20.10

$`6`
  0%  25%  50%  75% 100%
 1.7  8.0  9.7 11.5 20.7

$`7`
  0%  25%  50%  75% 100%
 4.1  6.9  8.6 10.9 14.9

$`8`
  0%  25%  50%  75% 100%
 2.3  6.6  8.6 11.2 15.5

$`9`
    0%    25%    50%    75%   100%
 2.800  7.550 10.300 12.325 16.600

Now let’s extend this example to calculate the quantiles by Month and (grouped) Temp:

> tapply(airquality$Wind,
+        list(airquality$Month, cut(airquality$Temp, 3)), quantile)
  (56,69.7] (69.7,83.3] (83.3,97]
5 Numeric,5 Numeric,5   NULL
6 Numeric,5 Numeric,5   Numeric,5
7 NULL      Numeric,5   Numeric,5
8 NULL      Numeric,5   Numeric,5
9 Numeric,5 Numeric,5   Numeric,5

The “simplification” process has now forced the outputs into a matrix, creating a “matrix of lists,” which is a particularly complex and unhelpful structure:

> X <- tapply(airquality$Wind,
+        list(airquality$Month, cut(airquality$Temp, 3)), quantile)
> class(X)
[1] "matrix"
> X
  (56,69.7] (69.7,83.3] (83.3,97]
5 Numeric,5 Numeric,5   NULL
6 Numeric,5 Numeric,5   Numeric,5
7 NULL      Numeric,5   Numeric,5
8 NULL      Numeric,5   Numeric,5
9 Numeric,5 Numeric,5   Numeric,5
> X[1,1]
[[1]]
    0%    25%    50%    75%   100%
 7.400  9.700 11.500 13.925 20.100

Return Values from tapply

As with sapply, the returns from tapply can sometimes be difficult to predict. Table 9.6 summarizes the return objects from tapply based on the number of return values from a function and the number of grouping variables.

Image

TABLE 9.6 Return Values from tapply

Given that tapply may return unexpected (and/or highly complex) values, we recommend the use of lapply and split instead of tapply, unless we can guarantee the number of return values from the function (so we can rely on the outputs).


Tip: The plyr Package

The plyr package was developed and is maintained by popular R package author, Hadley Wickham. It was first released to CRAN in 2008 and is still one of the most popular R packages on CRAN, with a huge number of packages depending on plyr functionality. The plyr package offers a more consistent “apply” syntax based on the input and output structures to which we apply a function. Functions follow the form [i][o]ply, where i and o represent the input and output format respectively. For example the function llply expects a list input and produces a list output:

> air <- split(airquality, airquality$Month)
> llply(air, dim)

In addition to providing an alternative apply framework plyr offers data manipulation functionality such as merging and aggregation. However, for those working with data frames, the dplyr package that you will be introduced to in Hour 12 provides a much more user-friendly approach to data manipulation and aggregation.


Summary

In this hour, we have looked at a number of ways we can apply simple functions to data structures in a more sophisticated way. Specifically, we’ve look at

Image The use of loops to iterate over data objects

Image The rich set of “apply” functions

Together, this provides a range of capabilities of summarizing data and performing tasks in a repetitive manner. In later hours, we’ll extend this to cover higher-level mechanisms for processing and aggregating data, with a focus on summarizing data frames. In Hour 18, we’ll also look again at loops and “apply” functions with respect to coding efficiency and performance.

Q&A

Q. How can I stop a “for” loop if a certain condition is met?

A. You can stop the for loop using the break construct, as follows:

> for (i in 1:100) {
+   cat(" Hello")           # Writing a message
+   if (runif(1) > .9) {
+     cat(" - STOP!!")
+     break  # 90% chance of stopping each time
+   }
+ }

 Hello
 Hello
 Hello
 Hello
 Hello
 Hello
 Hello
 Hello
 Hello - STOP!!

Q. How do I stop the process if I get stuck in an infinite “while” loop?

A. You can use the Esc key (in interactive mode) to stop the process.

Q. How could I apply a function over multiple lists at the same time?

A. The mapply function is a multivariate version of sapply, which allows us to apply functions over multiple lists at the same time. For example, let’s apply the rpois function over elements 1:5 (for the number of values to sample) and 5:1 (for the lambda values to use):

> mapply(rpois, n = 1:5, lambda = 5:1)
[[1]]
[1] 2

[[2]]
[1] 7 3

[[3]]
[1] 4 1 1

[[4]]
[1] 1 0 2 4

[[5]]
[1] 3 0 1 0 2

Q. How performant is a “for” loop compared to, say, an “apply” function?

A. Generally, the R language is optimized for vectorized operations, and it is quite possible to write very underperforming code using (nested) for loops. The “apply” family of functions can add some gains in terms of both performance and code maintenance. This will be discussed further in Hour 18.

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. What is the difference between a “for” and a “while” loop?

2. If you use a for loop to iterate over a vector of (character) column names, how would you use each value to reference a column in a data frame?

3. When using the apply function, what does the MARGIN argument control?

4. How do you pass additional arguments to a function you wish to “apply”?

5. What is the difference between sapply and lapply?

6. What does the split function do, and how can you use it in conjunction with lapply/sapply?

7. When using tapply, how do you specify that a summary is to be performed “by” more than one variable?

Answers

1. A “for” loop will iterative for a predefined set of values. A “while” loop instead iterates until a specified condition is no longer true.

2. If the condition results in a single missing value, then an error is returned:

> testMissing <- function(X) {
+   if (X > 0) cat("Success")
+ }
> testMissing(NA)
Error in if (X > 0) cat("Success") :
  missing value where TRUE/FALSE needed

If you use the all function with a condition that contains any missing values, the result is missing and therefore will also result in an error (since you do not know if “all” the conditions are met):

> allMissings <- rep(NA, 5)   # All missing values
> someMissings <- c(NA, 1:4)  # Some missing values
> all(allMissings > 0)
[1] NA
> all(someMissings > 0)
[1] NA

If we use the any function with a condition that contains all missing values, the result is a missing value. If, however, you use the any function with a vector where not all values are missing, some conditions may be met:

> any(allMissings > 0)
[1] NA
> any(someMissings > 0)
[1] TRUE

3. The MARGIN argument controls the dimension over which you want to apply your function (for example, 1 for rows, 2 for columns).

4. Each “apply” function has an ellipsis argument where you list additional arguments—for example, apply(Y, X min, na.rm = TRUE).

5. The lapply function applies a function to elements of a list (or vector) and (always) returns its results in a list. The sapply function performs exactly the same actions but, where possible, will try to simplify the output (for example, as a vector or array).

6. The split function will take a data object (typically a vector or data frame) and break it into parts based on one or more grouping variables, storing the results as a list. When the results are “broken” into a list structure, we can use lapply or sapply to apply a function to each element—for example, you can calculate the mean Y by levels of X using the following:

sapply(split(Y, X), mean)

7. You can specify multiple “by” variables using a list as follows:

tapply(Y, list(X1, X2), mean)

Activities

1. Create a “for” loop that iteratively prints each element of LETTERS on a new line.

2. Create a “for” loop that prints the mean mpg value (from the mtcars dataset) for each unique level of the carb variable.

3. Look at the provided WorldPhones matrix, which contains the total number of phones in different regions of the world between 1951 and 1961. Use the apply function to calculate the total number of phones by year and the maximum number of phones by region.

4. Create a list containing three numeric vectors. Use lapply or sapply to print the median value from each element of the list.

5. Use split together with sapply to calculate the median value of mpg (from the mtcars data) by levels of carb.

6. Use split together with lapply to calculate a summary (?summary) of the iris data by levels of Species.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset