Chapter 13. High-Performance Computing

In the previous chapter, you learned about a number of built-in functions and various packages tailored for data manipulation. Although these packages rely on different techniques and may be built under a different philosophy, they all make data filtering and aggregating much easier.

However, data processing is more than simple filtering and aggregating. Sometimes, it involves simulation and other computationintensive tasks. Compared to high-performance programming languages such as C and C++, R is much slower due to its dynamic design and the current implementation that prioritizes stability, ease, and power in statistical analysis and visualization over performance and language features. However, well-written R code can still be fast enough for most purposes.

In this chapter, I'll demonstrate the following techniques to help you write R code with high performance:

  • Measuring code performance
  • Profiling code to find bottleneck
  • Using built-in functions and vectorization
  • Using multiple cores by parallel computing
  • Writing C++ with Rcpp and related packages

Understanding code performance issues

From the very beginning, R is designed for statistical computing and data visualization and is widely used by academia and industry. For most data analysis purposes, correctness is more important than performance. In other words, getting a correct result in 1 minute should be better than getting an incorrect one in 20 seconds. A result that is three times faster is not automatically three times more valid than a slow but correct result. Therefore, performance should not be a concern before you are sure about the correctness of your code.

Let's assume that you are 100 percent sure that your code is correct but it runs a bit slowly. Now, is it necessary for you to optimize the code so that it can run faster. Well, it depends. Before making a decision, it is helpful to divide the time of problem solving into three parts: time of development, execution, and future maintenance.

Suppose we have been working on a problem for an hour. Since we didn't take performance into account at the beginning, the code does not run very fast. It takes us 50 minutes to think about the problem and implement the solution. Then, it takes 1 minute to run and produce an answer. Since the code aligns well with the problem and looks straightforward, future improvements can be easily integrated into the solution, so it takes us less time to maintain.

Then, suppose another developer has been working on the same problem but attempts to write an extremely high-performance code at the beginning. It takes time to work out a solution to the problem, but takes much more time to optimize the structure of the code so that it can run faster. It may take two hours to think of and implement a high-performance solution. Then, it takes 0.1 second to run and produce an answer. Since the code is particularly optimized to squeeze the hardware, it is probably not flexible for future refinement, especially when the problem is updated, which would cost more time to maintain.

The second developer can happily claim that her code has 600 times the performance of our code, but it may not be worth doing so because it may cost much more human time. In many cases, human time is more expensive than computer time.

However, if the code is frequently used, say, if billions of iterations are required, a small improvement of performance of each iteration can help save a large amount of time. In this case, code performance really matters.

Let's take an example of a simple algorithm that produces a numeric vector of cumulative sum, that is, each element of the output vector is the sum of all previous elements of the input vector. The code will be examined in different contexts in the discussion that follows.

Although R provides a built-in function, cumsum, to do this, we will implement an R version at the moment to help understand performance issues. The algorithm is easy to implement as follows:

x <- c(1, 2, 3, 4, 5) 
y <- numeric() 
sum_x <- 0 
for (xi in x) { 
  sum_x <- sum_x + xi 
  y <- c(y, sum_x) 
} 
y 
## [1]  1  3  6 10 15 

The algorithm only uses a for loop to accumulate each element of the input vector x into sum_x. In each iteration, it appends sum_x to the output vector y. We can rewrite the algorithm as the following function:

my_cumsum1 <- function(x) { 
  y <- numeric() 
  sum_x <- 0 
  for (xi in x) { 
    sum_x <- sum_x + xi 
    y <- c(y, sum_x) 
  } 
  y 
} 

An alternative implementation is to use index to access the input vector x and access/modify the output vector y:

my_cumsum2 <- function(x) { 
  y <- numeric(length(x)) 
  if (length(y)) { 
    y[[1]] <- x[[1]] 
    for (i in 2:length(x)) { 
      y[[i]] <- y[[I - 1]] + x[[i]] 
    } 
  } 
  y 
} 

We know that R provides a built-in function cumsum() to do exactly the same thing. The two preceding implementations should yield exactly the same results as cumsum(). Here, we will generate some random numbers and check whether they are consistent:

x <- rnorm(100) 
all.equal(cumsum(x), my_cumsum1(x)) 
## [1] TRUE 
all.equal(cumsum(x), my_cumsum2(x)) 
## [1] TRUE 

In the preceding code, all.equal() checks whether all corresponding elements of two vectors are equal. From the results, we are sure that my_cumsum1(), my_cumsum2() and cumsum() are consistent. In the next section, we'll measure the time required for each version of cumsum.

Measuring code performance

Although the three functions will output the same results given the same input, their performance difference can be quite obvious. To reveal the difference in performance, we need tools to measure the execution time of code. The simplest one is system.time().

To measure the execution time of any expression, we just wrap the code with the function. Here, we will measure how much time it takes by my_cumsum1() to compute over a numeric vector of 100 elements:

x <- rnorm(100) 
system.time(my_cumsum1(x)) 
##    user  system elapsed  
##       0       0       0 

The timer results in three columns: user, system, and elapsed. It is the user time that we should pay more attention to. It measures the CPU time charged for executing the code. For more details, run ?proc.time and see the difference between these measures.

The results suggest that the code simply runs too fast to measure. We can try timing my_cumsum2(), and the results are mostly the same:

system.time(my_cumsum2(x)) 
##    user  system elapsed  
##   0.000   0.000   0.001 

The same thing happens with the built-in function cumsum() too:

system.time(cumsum(x)) 
##    user  system elapsed  
##       0       0       0 

The timing does not really work because the input is too small. Now, we will generate a vector of 1000 numbers and do it again:

x <- rnorm(1000) 
system.time(my_cumsum1(x)) 
##    user  system elapsed  
##   0.000   0.000   0.003 
system.time(my_cumsum2(x)) 
##    user  system elapsed  
##   0.004   0.000   0.001 
system.time(cumsum(x)) 
##    user  system elapsed  
##       0       0       0 

Now, we are sure that my_cumsum1() and my_cumsum2() indeed take some time to compute the results but show no remarkable contrast. However, cumsum() is still too fast to measure.

We will again use a larger input for all three functions and see whether their performance difference can be revealed:

x <- rnorm(10000) 
system.time(my_cumsum1(x)) 
##    user  system elapsed  
##   0.208   0.000   0.211 
system.time(my_cumsum2(x)) 
##    user  system elapsed  
##   0.012   0.004   0.013 
system.time(cumsum(x)) 
##    user  system elapsed  
##       0       0       0 

The result is quite clear: my_cumsum1() looks more than 10 times slower than my_cumsum2(), and cumsum() is still way too fast than both our implementations.

Note that the performance difference may not be constant, especially when we provide even larger inputs as follows:

x <- rnorm(100000) 
system.time(my_cumsum1(x)) 
##    user  system elapsed  
##  25.732   0.964  26.699 
system.time(my_cumsum2(x)) 
##    user  system elapsed  
##   0.124   0.000   0.123 
system.time(cumsum(x)) 
##    user  system elapsed  
##       0       0       0 

The preceding results make quite an astonishing contrast: my_cumsum1() can be 200 times slower than my_cumsum2() when the length of input vector is at 100,000 level. The cumsum() function is consistently super fast in all previous results.

The system.time() function can help measure the execution time of a code chunk, but it is not very accurate. On the one hand, each time, the measure can result in different values so that we should repeat the timing for enough times to make a valid comparison. On the other hand, the resolution of the timer may not be high enough to address the real difference in performance of the code of interest.

A package named microbenchmark serves as a more accurate solution to comparing the performance of different expressions. To install the package, run the following code:

install.packages("microbenchmark") 

When the package is ready, we will load the package and call microbenchmark() to directly compare the performance of the three functions:

library(microbenchmark) 
x <- rnorm(100) 
microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x)) 
## Unit: nanoseconds 
##           expr    min       lq      mean   median       uq 
##  my_cumsum1(x)  58250  64732.5  68353.51  66396.0  71840.0 
##  my_cumsum2(x) 120150 127634.5 131042.40 130739.5 133287.5 
##      cumsum(x)    295    376.5    593.47    440.5    537.5 
##     max neval cld 
##   88228   100  b  
##  152845   100   c 
##    7182   100 a 

Note that microbenchmark(), by default, runs each expression 100 times so that it can provide more quantiles of the execution time. Maybe to your surprise, my_cumsum1() is a bit faster than my_cumsum2() when the input vector has 100 elements. Also, note that the unit of the time numbers is nanoseconds (1 second is 1,000,000,000 nanoseconds).

Then, we will try an input of 1000 numbers:

x <- rnorm(1000) 
microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x)) 
## Unit: microseconds 
##           expr      min        lq       mean    median 
##  my_cumsum1(x) 1600.186 1620.5190 2238.67494 1667.5605 
##  my_cumsum2(x) 1034.973 1068.4600 1145.00544 1088.4090 
##      cumsum(x)    1.806    2.1505    3.43945    3.4405 
##         uq      max neval cld 
##  3142.4610 3750.516   100   c 
##  1116.2280 2596.908   100  b  
##     4.0415   11.007   100 a 

Now, my_cumsum2() gets a bit faster than my_cumsum1(), but both are way slower than the built-in cumsum(). Note that the unit becomes microseconds now.

For input of 5000 numbers, the performance difference between my_cumsum1() and my_cumsum2() gets even greater:

x <- rnorm(5000) 
microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x)) 
## Unit: microseconds 
##           expr       min        lq        mean     median 
##  my_cumsum1(x) 42646.201 44043.050 51715.59988 44808.9745 
##  my_cumsum2(x)  5291.242  5364.568  5718.19744  5422.8950 
##      cumsum(x)    10.183    11.565    14.52506    14.6765 
##         uq        max neval cld 
##  46153.351 135805.947   100   c 
##   5794.821  10619.352   100  b  
##     15.536     37.202   100 a 

The same thing happens with an input of 10000 elements:

x <- rnorm(10000) 
microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x), times = 10) 
## Unit: microseconds 
##           expr        min         lq        mean     median 
##  my_cumsum1(x) 169609.730 170687.964 198782.7958 173248.004 
##  my_cumsum2(x)  10682.121  10724.513  11278.0974  10813.395 
##      cumsum(x)     20.744     25.627     26.0943     26.544 
##         uq        max neval cld 
##  253662.89 264469.677    10   b 
##   11588.99  13487.812    10  a  
##      27.64     29.163    10  a 

In all previous benchmarks, the performance of cumsum() looks very stable and does not increase significantly as the length of input increases.

To better understand the performance dynamics of the three functions, we will create the following function to visualize how they perform, provided an input of different lengths:

library(data.table) 
benchmark <- function(ns, times = 30) { 
  results <- lapply(ns, function(n) { 
    x <- rnorm(n) 
    result <- microbenchmark(my_cumsum1(x), my_cumsum2(x), cumsum(x),  
times = times, unit = "ms") 
    data <- setDT(summary(result)) 
    data[, n := n] 
    data 
  }) 
rbindlist(results) 
} 

The logic of the function is straightforward: ns is a vector of all lengths of input vectors we want to test with these functions. Note that microbenchmark() returns in a data frame of all tests results, and summary(microbenchmark()) returns the summary table we saw previously. We tag each summary with n, stack all benchmark results, and use the ggplot2 package to visualize the results.

First, we will do the benchmarking from 100 to 3000 elements of step 100:

benchmarks <- benchmark(seq(100, 3000, 100)) 

Then, we will create a plot to show contrast between the performance of the three functions:

library(ggplot2) 
ggplot(benchmarks, aes(x = n, color = expr)) + 
  ggtitle("Microbenchmark on cumsum functions") + 
  geom_point(aes(y = median)) + 
  geom_errorbar(aes(ymin = lq, ymax = uq)) 

This produces the following benchmarks of the three versions of cumsum we intend to compare:

Measuring code performance

In the preceding chart, we put together the results of all three functions. The dots indicate the median, and the error bar shows the 75th and 25th quantile.

It is very clear that the performance of my_cumsum1() decreases faster for longer input, the performance of my_cumsum2() almost decreases linearly as the input gets longer, while cumsum(x) is extremely fast and its performance does not seem to decay significantly as the input gets longer.

For small input, my_cumsum1() can be faster than my_cumsum2(), as we demonstrated earlier. We can do a benchmarking that focuses more on small input:

benchmarks2 <- benchmark(seq(2, 600, 10), times = 50) 

This time, we will limit the length of input vector from 2 to 500 of stop 10. Since the functions will be executed almost twice the number of times than the previous benchmarking, to keep the total execution time down, we will reduce times from the default 100 to 50:

ggplot(benchmarks2, aes(x = n, color = expr)) + 
  ggtitle("Microbenchmark on cumsum functions over small input") + 
  geom_point(aes(y = median)) + 
  geom_errorbar(aes(ymin = lq, ymax = uq)) 

The following graphics illustrates the performance difference at smaller inputs:

Measuring code performance

From the chart, we can see that for small input of less than around 400 numbers, my_cumsum1() is faster than my_cumsum2(). The performance of my_cumsum1() decays much faster than my_cumsum2() as the input gets more elements.

The dynamics of performance ranking can be better illustrated by a benchmarking of input from 10 to 800 elements:

benchmarks3 <- benchmark(seq(10, 800, 10), times = 50) 
ggplot(benchmarks3, aes(x = n, color = expr)) + 
  ggtitle("Microbenchmark on cumsum functions with break even") + 
  geom_point(aes(y = median)) + 
  geom_errorbar(aes(ymin = lq, ymax = uq)) 

The plot generated is shown as follows:

Measuring code performance

In conclusion, a small difference in implementation may result in big performance gaps. For a small input, the gap is usually not obvious, but when the input gets larger, the performance difference can be very significant and thus should not be ignored. To compare the performance of multiple expressions, we can use microbenchmark instead of system.time() to get more accurate and more useful results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset