In the previous chapter, you learned about a number of built-in functions and various packages tailored for data manipulation. Although these packages rely on different techniques and may be built under a different philosophy, they all make data filtering and aggregating much easier.
However, data processing is more than simple filtering and aggregating. Sometimes, it involves simulation and other computationintensive tasks. Compared to high-performance programming languages such as C and C++, R is much slower due to its dynamic design and the current implementation that prioritizes stability, ease, and power in statistical analysis and visualization over performance and language features. However, well-written R code can still be fast enough for most purposes.
In this chapter, I'll demonstrate the following techniques to help you write R code with high performance:
From the very beginning, R is designed for statistical computing and data visualization and is widely used by academia and industry. For most data analysis purposes, correctness is more important than performance. In other words, getting a correct result in 1 minute should be better than getting an incorrect one in 20 seconds. A result that is three times faster is not automatically three times more valid than a slow but correct result. Therefore, performance should not be a concern before you are sure about the correctness of your code.
Let's assume that you are 100 percent sure that your code is correct but it runs a bit slowly. Now, is it necessary for you to optimize the code so that it can run faster. Well, it depends. Before making a decision, it is helpful to divide the time of problem solving into three parts: time of development, execution, and future maintenance.
Suppose we have been working on a problem for an hour. Since we didn't take performance into account at the beginning, the code does not run very fast. It takes us 50 minutes to think about the problem and implement the solution. Then, it takes 1 minute to run and produce an answer. Since the code aligns well with the problem and looks straightforward, future improvements can be easily integrated into the solution, so it takes us less time to maintain.
Then, suppose another developer has been working on the same problem but attempts to write an extremely high-performance code at the beginning. It takes time to work out a solution to the problem, but takes much more time to optimize the structure of the code so that it can run faster. It may take two hours to think of and implement a high-performance solution. Then, it takes 0.1 second to run and produce an answer. Since the code is particularly optimized to squeeze the hardware, it is probably not flexible for future refinement, especially when the problem is updated, which would cost more time to maintain.
The second developer can happily claim that her code has 600 times the performance of our code, but it may not be worth doing so because it may cost much more human time. In many cases, human time is more expensive than computer time.
However, if the code is frequently used, say, if billions of iterations are required, a small improvement of performance of each iteration can help save a large amount of time. In this case, code performance really matters.
Let's take an example of a simple algorithm that produces a numeric vector of cumulative sum, that is, each element of the output vector is the sum of all previous elements of the input vector. The code will be examined in different contexts in the discussion that follows.
Although R provides a built-in function, cumsum
, to do this, we will implement an R version at the moment to help understand performance issues. The algorithm is easy to implement as follows:
x <- c(1, 2, 3, 4, 5) y <- numeric() sum_x <- 0 for (xi in x) { sum_x <- sum_x + xi y <- c(y, sum_x) } y ## [1] 1 3 6 10 15
The algorithm only uses a for
loop to accumulate each element of the input vector x
into sum_x
. In each iteration, it appends sum_x
to the output vector y
. We can rewrite the algorithm as the following function:
my_cumsum1 <- function(x) { y <- numeric() sum_x <- 0 for (xi in x) { sum_x <- sum_x + xi y <- c(y, sum_x) } y }
An alternative implementation is to use index to access the input vector x
and access/modify the output vector y
:
my_cumsum2 <- function(x) { y <- numeric(length(x)) if (length(y)) { y[[1]] <- x[[1]] for (i in 2:length(x)) { y[[i]] <- y[[I - 1]] + x[[i]] } } y }
We know that R provides a built-in function cumsum()
to do exactly the same thing. The two preceding implementations should yield exactly the same results as cumsum()
. Here, we will generate some random numbers and check whether they are consistent:
x <- rnorm(100) all.equal(cumsum(x), my_cumsum1(x)) ## [1] TRUE all.equal(cumsum(x), my_cumsum2(x)) ## [1] TRUE
In the preceding code, all.equal()
checks whether all corresponding elements of two vectors are equal. From the results, we are sure that my_cumsum1()
, my_cumsum2()
and cumsum()
are consistent. In the next section, we'll measure the time required for each version of cumsum
.
Although the three functions will output the same results given the same input, their performance difference can be quite obvious. To reveal the difference in performance, we need tools to measure the execution time of code. The simplest one is system.time()
.
To measure the execution time of any expression, we just wrap the code with the function. Here, we will measure how much time it takes by my_cumsum1()
to compute over a numeric vector of 100 elements:
x <- rnorm(100) system.time(my_cumsum1(x)) ## user system elapsed ## 0 0 0
The timer results in three columns: user
, system
, and elapsed
. It is the user time that we should pay more attention to. It measures the CPU time charged for executing the code. For more details, run ?proc.time
and see the difference between these measures.
The results suggest that the code simply runs too fast to measure. We can try timing my_cumsum2()
, and the results are mostly the same:
system.time(my_cumsum2(x)) ## user system elapsed ## 0.000 0.000 0.001
The same thing happens with the built-in function cumsum()
too:
system.time(cumsum(x)) ## user system elapsed ## 0 0 0
The timing does not really work because the input is too small. Now, we will generate a vector of 1000
numbers and do it again:
x <- rnorm(1000) system.time(my_cumsum1(x)) ## user system elapsed ## 0.000 0.000 0.003 system.time(my_cumsum2(x)) ## user system elapsed ## 0.004 0.000 0.001 system.time(cumsum(x)) ## user system elapsed ## 0 0 0
Now, we are sure that my_cumsum1()
and my_cumsum2()
indeed take some time to compute the results but show no remarkable contrast. However, cumsum()
is still too fast to measure.
We will again use a larger input for all three functions and see whether their performance difference can be revealed:
x <- rnorm(10000) system.time(my_cumsum1(x)) ## user system elapsed ## 0.208 0.000 0.211 system.time(my_cumsum2(x)) ## user system elapsed ## 0.012 0.004 0.013 system.time(cumsum(x)) ## user system elapsed ## 0 0 0
The result is quite clear: my_cumsum1()
looks more than 10 times slower than my_cumsum2()
, and cumsum()
is still way too fast than both our implementations.
Note that the performance difference may not be constant, especially when we provide even larger inputs as follows:
x <- rnorm(100000) system.time(my_cumsum1(x)) ## user system elapsed ## 25.732 0.964 26.699 system.time(my_cumsum2(x)) ## user system elapsed ## 0.124 0.000 0.123 system.time(cumsum(x)) ## user system elapsed ## 0 0 0
The preceding results make quite an astonishing contrast: my_cumsum1()
can be 200 times slower than my_cumsum2()
when the length of input vector is at 100,000 level. The cumsum()
function is consistently super fast in all previous results.
The system.time()
function can help measure the execution time of a code chunk, but it is not very accurate. On the one hand, each time, the measure can result in different values so that we should repeat the timing for enough times to make a valid comparison. On the other hand, the resolution of the timer may not be high enough to address the real difference in performance of the code of interest.
A package named microbenchmark
serves as a more accurate solution to comparing the performance of different expressions. To install the package, run the following code:
install.packages("microbenchmark")
When the package is ready, we will load the package and call microbenchmark()
to directly compare the performance of the three functions:
library(microbenchmark) x <- rnorm(100) microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x)) ## Unit: nanoseconds ## expr min lq mean median uq ## my_cumsum1(x) 58250 64732.5 68353.51 66396.0 71840.0 ## my_cumsum2(x) 120150 127634.5 131042.40 130739.5 133287.5 ## cumsum(x) 295 376.5 593.47 440.5 537.5 ## max neval cld ## 88228 100 b ## 152845 100 c ## 7182 100 a
Note that microbenchmark()
, by default, runs each expression 100 times so that it can provide more quantiles of the execution time. Maybe to your surprise, my_cumsum1()
is a bit faster than my_cumsum2()
when the input vector has 100 elements. Also, note that the unit of the time numbers is nanoseconds (1 second is 1,000,000,000 nanoseconds).
Then, we will try an input of 1000
numbers:
x <- rnorm(1000) microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x)) ## Unit: microseconds ## expr min lq mean median ## my_cumsum1(x) 1600.186 1620.5190 2238.67494 1667.5605 ## my_cumsum2(x) 1034.973 1068.4600 1145.00544 1088.4090 ## cumsum(x) 1.806 2.1505 3.43945 3.4405 ## uq max neval cld ## 3142.4610 3750.516 100 c ## 1116.2280 2596.908 100 b ## 4.0415 11.007 100 a
Now, my_cumsum2()
gets a bit faster than my_cumsum1()
, but both are way slower than the built-in cumsum()
. Note that the unit becomes microseconds now.
For input of 5000
numbers, the performance difference between my_cumsum1()
and my_cumsum2()
gets even greater:
x <- rnorm(5000) microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x)) ## Unit: microseconds ## expr min lq mean median ## my_cumsum1(x) 42646.201 44043.050 51715.59988 44808.9745 ## my_cumsum2(x) 5291.242 5364.568 5718.19744 5422.8950 ## cumsum(x) 10.183 11.565 14.52506 14.6765 ## uq max neval cld ## 46153.351 135805.947 100 c ## 5794.821 10619.352 100 b ## 15.536 37.202 100 a
The same thing happens with an input of 10000
elements:
x <- rnorm(10000) microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x), times = 10) ## Unit: microseconds ## expr min lq mean median ## my_cumsum1(x) 169609.730 170687.964 198782.7958 173248.004 ## my_cumsum2(x) 10682.121 10724.513 11278.0974 10813.395 ## cumsum(x) 20.744 25.627 26.0943 26.544 ## uq max neval cld ## 253662.89 264469.677 10 b ## 11588.99 13487.812 10 a ## 27.64 29.163 10 a
In all previous benchmarks, the performance of cumsum()
looks very stable and does not increase significantly as the length of input increases.
To better understand the performance dynamics of the three functions, we will create the following function to visualize how they perform, provided an input of different lengths:
library(data.table) benchmark <- function(ns, times = 30) { results <- lapply(ns, function(n) { x <- rnorm(n) result <- microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x), times = times, unit = "ms") data <- setDT(summary(result)) data[, n := n] data }) rbindlist(results) }
The logic of the function is straightforward: ns
is a vector of all lengths of input vectors we want to test with these functions. Note that microbenchmark()
returns in a data frame of all tests results, and summary(microbenchmark())
returns the summary table we saw previously. We tag each summary with n
, stack all benchmark results, and use the ggplot2
package to visualize the results.
First, we will do the benchmarking from 100
to 3000
elements of step 100
:
benchmarks <- benchmark(seq(100, 3000, 100))
Then, we will create a plot to show contrast between the performance of the three functions:
library(ggplot2) ggplot(benchmarks, aes(x = n, color = expr)) + ggtitle("Microbenchmark on cumsum functions") + geom_point(aes(y = median)) + geom_errorbar(aes(ymin = lq, ymax = uq))
This produces the following benchmarks of the three versions of cumsum
we intend to compare:
In the preceding chart, we put together the results of all three functions. The dots indicate the median, and the error bar shows the 75th and 25th quantile.
It is very clear that the performance of my_cumsum1()
decreases faster for longer input, the performance of my_cumsum2()
almost decreases linearly as the input gets longer, while cumsum(x)
is extremely fast and its performance does not seem to decay significantly as the input gets longer.
For small input, my_cumsum1()
can be faster than my_cumsum2()
, as we demonstrated earlier. We can do a benchmarking that focuses more on small input:
benchmarks2 <- benchmark(seq(2, 600, 10), times = 50)
This time, we will limit the length of input vector from 2
to 500
of stop 10
. Since the functions will be executed almost twice the number of times than the previous benchmarking, to keep the total execution time down, we will reduce times
from the default 100
to 50
:
ggplot(benchmarks2, aes(x = n, color = expr)) + ggtitle("Microbenchmark on cumsum functions over small input") + geom_point(aes(y = median)) + geom_errorbar(aes(ymin = lq, ymax = uq))
The following graphics illustrates the performance difference at smaller inputs:
From the chart, we can see that for small input of less than around 400
numbers, my_cumsum1()
is faster than my_cumsum2()
. The performance of my_cumsum1()
decays much faster than my_cumsum2()
as the input gets more elements.
The dynamics of performance ranking can be better illustrated by a benchmarking of input from 10
to 800
elements:
benchmarks3 <- benchmark(seq(10, 800, 10), times = 50) ggplot(benchmarks3, aes(x = n, color = expr)) + ggtitle("Microbenchmark on cumsum functions with break even") + geom_point(aes(y = median)) + geom_errorbar(aes(ymin = lq, ymax = uq))
The plot generated is shown as follows:
In conclusion, a small difference in implementation may result in big performance gaps. For a small input, the gap is usually not obvious, but when the input gets larger, the performance difference can be very significant and thus should not be ignored. To compare the performance of multiple expressions, we can use microbenchmark
instead of system.time()
to get more accurate and more useful results.