Boosting code performance

In the previous section, we demonstrated how to use profiling tools to identify a performance bottleneck in the code. In this section, you will learn about a number of approaches to boosting code performance.

Using built-in functions

Previously, we demonstrated the performance difference between my_cumsum1(), my_cumsum2() and the built-in function cumsum(). Although my_cumsum2() is faster than my_cumsum1(), when the input vector contains many numbers, cumsum() is much faster than them. Also, its performance does not decay significantly even as the input gets longer. If we evaluate cumsum, we can see that it is a primitive function:

cumsum 
## function (x)  .Primitive("cumsum") 

A primitive function in R is implemented in C/C++/Fortran, compiled to native instructions, and thus, is extremely efficient. Another example is diff(). Here, we will implement computing vector difference sequence in R:

diff_for <- function(x) { 
  n <- length(x) - 1 
  res <- numeric(n) 
  for (i in seq_len(n)) { 
    res[[i]] <- x[[i + 1]] - x[[i]] 
  } 
  res 
} 

We can verify that the implementation is correct:

diff_for(c(2, 3, 1, 5)) 
## [1]  1 -2  4 

Therefore, both diff_for() and built-in diff() must return the same result for the same input:

x <- rnorm(1000) 
all.equal(diff_for(x), diff(x)) 
## [1] TRUE 

However, there's a big gap in performance between the two functions.

microbenchmark(diff_for(x), diff(x)) 
## Unit: microseconds 
##         expr      min        lq       mean    median 
##  diff_for(x) 1034.028 1078.9075 1256.01075 1139.1270 
##      diff(x)   12.057   14.2535   21.72772   17.5705 
##         uq      max neval cld 
##  1372.1145 2634.128   100   b 
##    25.4525   75.850   100  a 

Built-in functions are, in most cases, way faster than equivalent R implementations. This is true not only for vector functions, but also for matrices. For example, here is a simple 3 by 4 integer matrix:

mat <- matrix(1:12, nrow = 3) 
mat 
##      [,1] [,2] [,3] [,4] 
## [1,]    1    4    7   10 
## [2,]    2    5    8   11 
## [3,]    3    6    9   12 

We can write a function to transpose the matrix:

my_transpose <- function(x) {
   stopifnot(is.matrix(x))
   res <- matrix(vector(mode(x), length(x)), 
     nrow = ncol(x), ncol = nrow(x), 
     dimnames = dimnames(x)[c(2, 1)])
   for (i in seq_len(ncol(x))) {
     for (j in seq_len(nrow(x))) {
       res[i, j] <- x[j, i]
     }
   }
   res
 }

In the function, we will first create a matrix of the same type as the input, but with the number and names of rows and columns exchanged, respectively. Then, we will iterate over columns and rows to transpose the matrix:

my_transpose(mat) 
##      [,1] [,2] [,3] 
## [1,]    1    2    3 
## [2,]    4    5    6 
## [3,]    7    8    9 
## [4,]   10   11   12 

The built-in function of matrix transpose is t(). We can easily verify that both functions return the same results:

all.equal(my_transpose(mat), t(mat)) 
## [1] TRUE 

However, they may exhibit great difference in performance:

microbenchmark(my_transpose(mat), t(mat)) 
## Unit: microseconds 
##               expr    min     lq     mean  median      uq 
##  my_transpose(mat) 22.795 24.633 29.47941 26.0865 35.5055 
##             t(mat)  1.576  1.978  2.87349  2.3375  2.7695 
##     max neval cld 
##  71.509   100   b 
##  16.171   100  a 

The performance difference gets even more significant when the input matrix is larger. Here, we will create a new matrix with 1000 rows and 25 columns. While the results are the same, the performance can be very different:

mat <- matrix(rnorm(25000), nrow = 1000) 
all.equal(my_transpose(mat), t(mat)) 
## [1] TRUE 
microbenchmark(my_transpose(mat), t(mat)) 
## Unit: microseconds 
##               expr       min         lq      mean 
##  my_transpose(mat) 21786.241 22456.3990 24466.055 
##             t(mat)    36.611    46.2045    61.047 
##      median        uq        max neval cld 
##  23821.5905 24225.142 113395.811   100   b 
##     57.7505    68.694    142.126   100   a 

Note that t() is a generic function that works with both matrix and data frame. S3 dispatching to find the right method for the input, also has some overhead. Therefore, directly calling t.default() on a matrix is slightly faster:

microbenchmark(my_transpose(mat), t(mat), t.default(mat)) 
## Unit: microseconds 
##               expr       min         lq        mean 
##  my_transpose(mat) 21773.751 22498.6420 23673.26089 
##             t(mat)    37.853    48.8475    63.57713 
##     t.default(mat)    35.518    41.0305    52.97680 
##      median         uq       max neval cld 
##  23848.6625 24139.7675 29034.267   100   b 
##     61.3565    69.6655   140.061   100   a  
##     46.3095    54.0655   146.755   100   a 

All previous examples show that, in most cases, it is much better to use built-in functions if provided than reinventing the wheel in R. These functions get rid of the overhead of R code and, thus, can be extremely efficient even if the input is huge.

Using vectorization

A special subset of built-in functions are arithmetic operators such as +, -, *, /, ^, and %%. These operators are not only extremely efficient but also vectorized.

Suppose we implement + in R:

add <- function(x, y) {
   stopifnot(length(x) == length(y),
     is.numeric(x), is.numeric(y))
   z <- numeric(length(x))
   for (i in seq_along(x)) {
     z[[i]] <- x[[i]] + y[[i]]
   }
   z
 }

Then, we would randomly generate x and y. The add(x, y), and x + y arguments should return exactly the same results:

x <- rnorm(10000) 
y <- rnorm(10000) 
all.equal(add(x, y), x + y) 
## [1] TRUE 

The following benchmarking shows that the performance difference is huge:

microbenchmark(add(x, y), x + y) 
## Unit: microseconds 
##       expr      min         lq        mean     median 
##  add(x, y) 9815.495 10055.7045 11478.95003 10712.7710 
##      x + y   10.260    12.0345    17.31862    13.3995 
##         uq       max neval cld 
##  12598.366 18754.504   100   b 
##     22.208    56.969   100  a 

Now, suppose we need to calculate the sum of the reciprocal of first n positive integers squared. We can easily implement the algorithm using a for loop as the following function algo1_for:

algo1_for <- function(n) { 
  res <- 0 
  for (i in seq_len(n)) { 
    res <- res + 1 /i ^ 2 
  } 
  res 
} 

The function takes an input n, iterates n times to accumulate as supposed, and returns the result.

A better approach is to use vectorized calculation directly without any necessity of a for loop, just like how algo1_vec() is implemented:

algo1_vec <- function(n) { 
  sum(1 / seq_len(n) ^ 2) 
} 

The two functions yield the same results, given an ordinary input:

algo1_for(10) 
## [1] 1.549768 
algo1_vec(10) 
## [1] 1.549768 

However, their performance is very different:

microbenchmark(algo1_for(200), algo1_vec(200)) 
## Unit: microseconds 
##            expr    min       lq      mean   median      uq 
##  algo1_for(200) 91.727 101.2285 104.26857 103.6445 105.632 
##  algo1_vec(200)  2.465   2.8015   3.51926   3.0355   3.211 
##      max neval cld 
##  206.295   100   b 
##   19.426   100  a 
microbenchmark(algo1_for(1000), algo1_vec(1000)) 
## Unit: microseconds 
##             expr     min       lq      mean  median 
##  algo1_for(1000) 376.335 498.9320 516.63954 506.859 
##  algo1_vec(1000)   8.718   9.1175   9.82515   9.426 
##        uq      max neval cld 
##  519.2420 1823.502   100   b 
##    9.8955   20.564   100  a 

Vectorization is a highly recommended way of writing R code. It is not only of high performance but, also makes the code easier to understand.

Using byte-code compiler

In the previous section, we saw the power of vectorization. Sometimes, however, the problem dictates a for loop, and it is hard to vectorize the code. In this case, we may consider using R byte-code compiler to compile the function so that the function no longer needs parsing and may run faster.

First, we will load the compiler package, which is distributed along with R. We will use cmpfun() to compile a given R function. For example, we will compile diff_for() and store the compiled function as diff_cmp():

library(compiler) 
diff_cmp <- cmpfun(diff_for) 
diff_cmp 
## function(x) { 
##   n <- length(x) - 1 
##   res <- numeric(n) 
##   for (i in seq_len(n)) { 
##     res[[i]] <- x[[i + 1]] - x[[i]] 
##   } 
##   res 
## } 
## <bytecode: 0x93aec08> 

When we look at diff_cmp(), it does not look very different from diff_for(), but it has an additional tag of the bytecode address.

Then, we will run the benchmarking again with diff_cmp() this time:

x <- rnorm(10000) 
microbenchmark(diff_for(x), diff_cmp(x), diff(x)) 
## Unit: microseconds 
##         expr       min         lq       mean     median 
##  diff_for(x) 10664.387 10940.0840 11684.3285 11357.9330 
##  diff_cmp(x)   732.110   740.7610   760.1985   751.0295 
##      diff(x)    80.824    91.2775   107.8473   103.8535 
##        uq       max neval cld 
##  12179.98 16606.291   100   c 
##    763.66  1015.234   100  b  
##    115.11   219.396   100 a 

It looks amazing that the compiled version, diff_cmp(), is much faster than diff_for() even though we didn't modify anything but compiled it into bytecode.

Now, we will do the same thing with algo1_for():

algo1_cmp <- cmpfun(algo1_for) 
algo1_cmp 
## function(n) { 
##   res <- 0 
##   for (i in seq_len(n)) { 
##     res <- res + 1 / i ^ 2 
##   } 
##   res 
## } 
## <bytecode: 0xa87e2a8> 

Then, we will conduct the benchmarking with the compiled version included:

n <- 1000 
microbenchmark(algo1_for(n), algo1_cmp(n), algo1_vec(n)) 
## Unit: microseconds 
##          expr     min       lq      mean   median       uq 
##  algo1_for(n) 490.791 499.5295 509.46589 505.7560 517.5770 
##  algo1_cmp(n)  55.588  56.8355  58.10490  57.8270  58.7140 
##  algo1_vec(n)   8.688   9.2150   9.79685   9.4955   9.8895 
##      max neval cld 
##  567.680   100   c 
##   69.734   100  b  
##   19.765   100 a 

Again, the compiled version becomes more than six times faster than the original version, even if we didn't change a bit of code.

However, compiling is no magic if it is used to compile a fully vectorized function. Here, we will compile algo1_vec() and compare its performance with the original version:

algo1_vec_cmp <- cmpfun(algo1_vec) 
microbenchmark(algo1_vec(n), algo1_vec_cmp(n), times = 10000) 
## Unit: microseconds 
##              expr  min    lq      mean median    uq 
##      algo1_vec(n) 8.47 8.678 20.454858  8.812 9.008 
##  algo1_vec_cmp(n) 8.35 8.560  9.701012  8.687 8.864 
##        max neval cld 
##  96376.483 10000   a 
##   1751.431 10000   a 

Note that the compiled function shows no significant performance improvement. To know more about how the compiler works, type ?compile and read the documentation.

Using Intel MKL-powered R distribution

The R distribution we normally use is single threaded, that is, only one CPU thread is used to execute all R code. The good thing is that the execution model is simple and safe, but it does not take advantage of multicore computing.

Microsoft R Open (MRO, see https://mran.microsoft.com/open/) is an enhanced distribution of R. Powered by Intel Math Kernel Library (MKL, see https://software.intel.com/en-us/intel-mkl), MRO enhances the matrix algorithms by automatically taking advantage of multithreading computation. On a multicore computer, MRO can be 10-80 times faster than the official R implementation at matrix multiplication, Cholesky factorization, QR decomposition, singular value decomposition, principal component analysis, and linear discriminant analysis. For more details, visit https://mran.microsoft.com/documents/rro/multithread/ and see the benchmarking.

Using parallel computing

As we mentioned in the previous section, R is single threaded in design but still allows multiprocessing parallel computing, that is, running multiple R sessions to compute. This technique is supported by a parallel library, which is also distributed along with R.

Suppose we need to do a simulation: we need to generate a random path that follows a certain random process and see whether at any point, the value goes beyond a fixed margin around the starting point.

The following code generates one realization:

set.seed(1) 
sim_data <- 100 * cumprod(1 + rnorm(500, 0, 0.006)) 
plot(sim_data, type = "s", ylim = c(85, 115), 
  main = "A simulated random path") 
abline(h = 100, lty = 2, col = "blue") 
abline(h = 100 * (1 + 0.1 * c(1, -1)), lty = 3, col = "red") 

The plot generated is shown as follows:

Using parallel computing

The preceding graph shows the path and 10 percent margin. It is clear that between index 300 and 500, the value goes beyond the upper margin multiple times.

This is just one path. A valid simulation requires that the generator run as many times as necessary to produce statistically meaningful results. The following function parameterizes the random path generator and returns a list of summary indicators of interest. Note that signal indicates whether any point on the path goes beyond the margin:

simulate <- function(i, p = 100, n = 10000, 
   r = 0, sigma = 0.0005, margin = 0.1) {
   ps <- p * cumprod(1 + rnorm(n, r, sigma))
   list(id = i, 
     first = ps[[1]], 
     high = max(ps), 
     low = min(ps), 
     last = ps[[n]],
     signal = any(ps > p * (1 + margin) | ps < p * (1 - margin)))
 }

Then, we can run the generator for one time and see its summarized result:

simulate(1) 
## $id 
## [1] 1 
##  
## $first 
## [1] 100.0039 
##  
## $high 
## [1] 101.4578 
##  
## $low 
## [1] 94.15108 
##  
## $last 
## [1] 96.13973 
##  
## $signal 
## [1] FALSE 

To perform the simulation, we need to run the function many times. In practice, we may need to run at least millions of realizations, which may take us a considerable amount of time. Here, we will measure how much time it costs to run ten thousand iterations of this simulation:

system.time(res <- lapply(1:10000, simulate)) 
##    user  system elapsed  
##   8.768   0.000   8.768 

When the simulation is finished, we can convert all results into one data table:

library(data.table) 
res_table <- rbindlist(res) 
head(res_table) 
##    id     first     high      low      last signal 
## 1:  1 100.03526 100.7157 93.80330 100.55324  FALSE 
## 2:  2 100.03014 104.7150 98.85049 101.97831  FALSE 
## 3:  3  99.99356 104.9834 95.28500  95.59243  FALSE 
## 4:  4  99.93058 103.4315 96.10691  97.22223  FALSE 
## 5:  5  99.99785 100.6041 94.12958  95.97975  FALSE 
## 6:  6 100.03235 102.1770 94.65729  96.49873  FALSE 

We can calculate the realized probability of signal == TRUE:

res_table[, sum(signal) /.N] 
## [1] 0.0881 

What if the problem gets more practical and requires us to run millions of times? In this case, some researchers may turn to programming languages implemented with much higher performance such as C and C++, which are extremely efficient and flexible. They are great tools in implementing algorithms but require more effort to deal with the compiler, linker, and data input/output.

Note that each iteration in the preceding simulation is completely independent of each other, so it is better accomplished by parallel computing.

Since different operating systems have different implementations of process and threading model, some features that are available for Linux and MacOS are not available for Windows. Thus, performing parallel computing on Windows can be a bit more verbose.

Using parallel computing on Windows

On Windows, we need to create a local cluster of multiple R sessions to run parallel computing:

library(parallel) 
cl <- makeCluster(detectCores()) 

The detectCores() function returns the number of cores your computer is equipped with. Creating a cluster of more than that number of nodes is allowed but usually does no good because your computer cannot perform more tasks than that simultaneously.

Then, we can call parLapply(), the parallel version of lapply():

system.time(res <- parLapply(cl, 1:10000, simulate)) 
##    user  system elapsed  
##   0.024   0.008   3.772 

Note that the time consumed is reduced to more than half of the original time. Now, we no longer need the cluster. We can call stopCluster() to kill the R sessions just created:

stopCluster(cl) 

When we call parLapply(), it automatically schedules the task for each cluster node. More specifically, all cluster nodes run simulate() with one of 1:10000 exclusively at the same time so that the computation is done in parallel. Finally, all results are collected so that we get a list just like the results from lapply():

length(res) 
## [1] 10000 
res_table <- rbindlist(res) 
res_table[, sum(signal) /.N] 
## [1] 0.0889 

The parallel code looks simple because simulate() is self-contained and does not rely on user-defined external variables or datasets. If we run a function in parallel that refers to a variable in the master session (the current session that creates the cluster), it will not find the variable:

cl <- makeCluster(detectCores()) 
n <- 1 
parLapply(cl, 1:3, function(x) x + n) 
## Error in checkForRemoteErrors(val): 3 nodes produced errors; first error: object 'n' not found 
stopCluster(cl) 

All nodes fail because each of them starts as a fresh R session with no user variables defined. To let the cluster nodes get the value of the variable they need, we have to export them to all nodes.

The following example demonstrates how this works. Suppose we have a data frame of numbers. We want to take random samples from the data frame:

n <- 100 
data <- data.frame(id = 1:n,  x = rnorm(n), y = rnorm(n)) 
 
take_sample <- function(n) { 
  data[sample(seq_len(nrow(data)), 
    size = n, replace = FALSE), ] 
} 

If we perform the sampling in parallel, all nodes must share the data frame and the function. To do this, we can use clusterEvalQ() to evaluate an expression on each cluster node. First, we will make a cluster just as we did earlier:

cl <- makeCluster(detectCores()) 

The Sys.getpid() function returns the process ID of the current R session. Since there are four nodes in the cluster, each is an R session with a unique process ID. We can call clusterEvalQ() with Sys.getpid() and see the process ID of each node:

clusterEvalQ(cl, Sys.getpid()) 
## [[1]] 
## [1] 20714 
##  
## [[2]] 
## [1] 20723 
##  
## [[3]] 
## [1] 20732 
##  
## [[4]] 
## [1] 20741 

To see the variables in the global environment of each node, we can call ls(), just like we call in our own working environment:

clusterEvalQ(cl, ls()) 
## [[1]] 
## character(0) 
##  
## [[2]] 
## character(0) 
##  
## [[3]] 
## character(0) 
##  
## [[4]] 
## character(0) 

As we mentioned, all cluster nodes are, by default, initialized with an empty global environment. To export data and take_sample to each node, we can call clusterExport():

clusterExport(cl, c("data", "take_sample")) 
clusterEvalQ(cl, ls()) 
## [[1]] 
## [1] "data"        "take_sample" 
##  
## [[2]] 
## [1] "data"        "take_sample" 
##  
## [[3]] 
## [1] "data"        "take_sample" 
##  
## [[4]] 
## [1] "data"        "take_sample" 

Now, we can see that all nodes have data and take_sample. Now, we can let each node call take_sample():

clusterEvalQ(cl, take_sample(2)) 
## [[1]] 
##    id         x           y 
## 88 88 0.6519981  1.43142886 
## 80 80 0.7985715 -0.04409101 
##  
## [[2]] 
##    id          x          y 
## 65 65 -0.4705287 -1.0859630 
## 35 35  0.6240227 -0.3634574 
##  
## [[3]] 
##    id         x          y 
## 75 75 0.3994768 -0.1489621 
## 8   8 1.4234844  1.8903637 
##  
## [[4]] 
##    id         x         y 
## 77 77 0.4458477  1.420187 
## 9   9 0.3943990 -0.196291 

Alternatively, we can use clusterCall() and <<- to create global variables in each node, while <- only creates local variables in the function:

invisible(clusterCall(cl, function() { 
  local_var <- 10 
  global_var <<- 100 
})) 
clusterEvalQ(cl, ls()) 
## [[1]] 
## [1] "data"        "global_var"  "take_sample" 
##  
## [[2]] 
## [1] "data"        "global_var"  "take_sample" 
##  
## [[3]] 
## [1] "data"        "global_var"  "take_sample" 
##  
## [[4]] 
## [1] "data"        "global_var"  "take_sample" 

Note that clusterCall() returns the returned value from each node. In the preceding code, we will use invisible() to suppress the values they return.

Since each cluster node is started in a fresh state, they only load basic packages. To let each node load the given packages, we can also use clusterEvalQ(). The following code lets each node attach the data.table package so that parLapply() can run a function in which data.table functions are used on each node:

clusterExport(cl, "simulate") 
invisible(clusterEvalQ(cl, { 
  library(data.table) 
})) 
res <- parLapply(cl, 1:3, function(i) { 
  res_table <- rbindlist(lapply(1:1000, simulate)) 
  res_table[, id := NULL] 
  summary(res_table) 
}) 

A list of data summary is returned:

res 
## [[1]] 
##      first             high             low         
##  Min.   : 99.86   Min.   : 99.95   Min.   : 84.39   
##  1st Qu.: 99.97   1st Qu.:101.44   1st Qu.: 94.20   
##  Median :100.00   Median :103.32   Median : 96.60   
##  Mean   :100.00   Mean   :103.95   Mean   : 96.04   
##  3rd Qu.:100.03   3rd Qu.:105.63   3rd Qu.: 98.40   
##  Max.   :100.17   Max.   :121.00   Max.   :100.06   
##       last          signal        
##  Min.   : 84.99   Mode :logical   
##  1st Qu.: 96.53   FALSE:911       
##  Median : 99.99   TRUE :89        
##  Mean   : 99.92   NA's :0         
##  3rd Qu.:103.11                   
##  Max.   :119.66                   
##  
## [[2]] 
##      first             high             low         
##  Min.   : 99.81   Min.   : 99.86   Min.   : 83.67   
##  1st Qu.: 99.96   1st Qu.:101.48   1st Qu.: 94.32   
##  Median :100.00   Median :103.14   Median : 96.42   
##  Mean   :100.00   Mean   :103.91   Mean   : 96.05   
##  3rd Qu.:100.04   3rd Qu.:105.76   3rd Qu.: 98.48   
##  Max.   :100.16   Max.   :119.80   Max.   :100.12   
##       last          signal        
##  Min.   : 85.81   Mode :logical   
##  1st Qu.: 96.34   FALSE:914       
##  Median : 99.69   TRUE :86        
##  Mean   : 99.87   NA's :0         
##  3rd Qu.:103.31                   
##  Max.   :119.39                   
##  
## [[3]] 
##      first             high             low         
##  Min.   : 99.84   Min.   : 99.88   Min.   : 85.88   
##  1st Qu.: 99.97   1st Qu.:101.61   1st Qu.: 94.26   
##  Median :100.00   Median :103.42   Median : 96.72   
##  Mean   :100.00   Mean   :104.05   Mean   : 96.12   
##  3rd Qu.:100.03   3rd Qu.:105.89   3rd Qu.: 98.35   
##  Max.   :100.15   Max.   :117.60   Max.   :100.03   
##       last          signal        
##  Min.   : 86.05   Mode :logical   
##  1st Qu.: 96.70   FALSE:920       
##  Median :100.16   TRUE :80        
##  Mean   :100.04   NA's :0         
##  3rd Qu.:103.24                   
##  Max.   :114.80 

When we don't need the cluster any more, we will run the following code to release it:

stopCluster(cl) 

Using parallel computing on Linux and MacOS

Using parallel computing on Linux and MacOS can be much easier than on Windows. Without having to manually create a socket-based cluster, mclapply() directly forks the current R session into multiple R sessions, with everything preserved to continue running in parallel and schedule tasks for each child R session:

system.time(res <- mclapply(1:10000, simulate,  
  mc.cores = detectCores())) 
##    user  system elapsed  
##   9.732   0.060   3.415 

Therefore, we don't have to export the variables because they are immediately available in each fork process:

mclapply(1:3, take_sample, mc.cores = detectCores()) 
## [[1]] 
##    id         x          y 
## 62 62 0.1679572 -0.5948647 
##  
## [[2]] 
##    id         x           y 
## 56 56 1.5678983  0.08655707 
## 39 39 0.1015022 -1.98006684 
##  
## [[3]] 
##    id           x          y 
## 98 98  0.13892696 -0.1672610 
## 4   4  0.07533799 -0.6346651 
## 76 76 -0.57345242 -0.5234832 

Also, we can create jobs to be done in parallel with much flexibility. For example, we will create a job that generates 10 random numbers:

job1 <- mcparallel(rnorm(10), "job1") 

As long as the job is created, we can choose to collect the results from the job with mccollect(). Then, the function will not return until the job is finished:

mccollect(job1) 
## $`20772` 
##  [1]  1.1295953 -0.6173255  1.2859549 -0.9442054  0.1482608 
##  [6]  0.4242623  0.9463755  0.6662561  0.4313663  0.6231939 

We can also programmatically create a number of jobs to run in parallel. For example, we create 8 jobs, and each sleeps for a random time. Then, mccollect() won't return until all jobs are finished sleeping. Since the jobs are run in parallel, the time mccollect() takes won't be too long:

jobs <- lapply(1:8, function(i) {
   mcparallel({
     t <- rbinom(1, 5, 0.6)
     Sys.sleep(t)
     t
   }, paste0("job", i))
 })
 system.time(res <- mccollect(jobs))
##    user  system elapsed 
##   0.012   0.040   4.852

This allows us to customize the task-scheduling mechanism.

Using Rcpp

As we mentioned, parallel computing works when each iteration is independent so that the final results do not rely on the order of execution. However, not all tasks are so ideal like this. Therefore, the use of parallel computing may be undermined. What if we really want the algorithm to run fast and easily interact with R? The answer is by writing the algorithm in C++ via Rcpp (http://www.rcpp.org/).

C++ code usually runs very fast, because it is compiled to native instructions and is thus much closer to hardware level than a scripting language like R. Rcpp is a package that enables us to write C++ code with seamless R and C++ integration. With Rcpp, we can write C++ code in which we can call R functions and take advantage of R data structures. It allows us to write high-performance code and preserve the power of data manipulation in R at the same time.

To use Rcpp, we first need to ensure that the system is prepared for computing native code with the right toolchain. Under Windows, Rtools is needed and can be found at https://cran.r-project.org/bin/windows/Rtools/. Under Linux and MacOS, a properly installed C/C++ toolchain is required.

Once the toolchain is properly installed, run the following code to install the package:

install.packages("Rcpp") 

Then, we will create a C++ source file at code/rcpp-demo.cpp with the following code:

#include <Rcpp.h> 
usingnamespace Rcpp; 
 
// [[Rcpp::export]] 
NumericVector timesTwo(NumericVector x) { 
  return x * 2; 
} 

The preceding code is written in C++. If you are not familiar with C++ syntax, you can quickly pick up the simplest part by going through http://www.learncpp.com/. The language design and supported features are much richer and more complex than R. Don't expect to be an expert in a short period of time, but getting started with the basics usually allows you to write simple algorithms.

If you read the preceding code, it looks very different from typical R code. Since C++ is a strong-typed language, we need to specify the types of function arguments and the return type of functions. A function that is commented with [[Rcpp::export]] will be captured by Rcpp, and when we source the code in RStudio or use Rcpp::sourceCpp directly, these C++ functions will be automatically compiled and ported to our working environment in R.

The preceding C++ function simply takes a numeric vector and returns a new numeric vector with all x elements doubled. Note that the NumericVector class is provided by Rcpp.h included at the beginning of the source file. In fact, Rcpp.h provides the C++ proxy of all commonly used R data structures. Now, we will call Rcpp::sourceCpp() to compile and load the source file:

Rcpp::sourceCpp("code/rcpp-demo.cpp") 

The function compiles the source code, links it to necessary shared libraries, and exposes an R function to the environment. The beauty is that all of these are done automatically, which makes it much easier to write algorithms for non-professional C++ developers. Now, we have an R function to call it:

timesTwo 
## function (x)  
## .Primitive(".Call")(<pointer: 0x7f81735528c0>, x) 

We can see that timeTwo in R does not look like an ordinary function, but performs a native call to the C++ function. The function works with single numeric input:

timesTwo(10) 
## [1] 20 

It also works with a multi-element numeric vector:

timesTwo(c(1, 2, 3)) 
## [1] 2 4 6 

Now, we can use very simple C++ language constructs to reimplement the algo1_for algorithm in C++. Now, we will create a C++ source file at code/rcpp-algo1.cpp with the following code:

#include <Rcpp.h>
 using namespace Rcpp;
 
 // [[Rcpp::export]]
 double algo1_cpp(int n) {
   double res = 0;
   for (double i = 1; i < n; i++) {
     res += 1 / (i * i);
   }
   return res;
 }

Note that we don't use any R but C++ data structures in algo1_cpp. When we source the code, Rcpp will handle all the porting for us:

Rcpp::sourceCpp("code/rcpp-algo1.cpp") 

The function works with a single numeric input:

algo1_cpp(10) 
## [1] 1.539768 

If we supply a numeric vector, an error will occur:

algo1_cpp(c(10, 15)) 
## Error in eval(expr, envir, enclos): expecting a single value 

Now, we can do the benchmarking again. This time, we will add algo1_cpp to the list of alternative implementations. Here, we will compare the version using a for loop in R, the byte-code compiled version using for loop in R, the vectorized version, and the C++ version:

n <- 1000 
microbenchmark( 
  algo1_for(n),  
  algo1_cmp(n),  
  algo1_vec(n),  
  algo1_cpp(n)) 
## Unit: microseconds 
##          expr     min       lq      mean   median       uq 
##  algo1_for(n) 493.312 507.7220 533.41701 513.8250 531.5470 
##  algo1_cmp(n)  57.262  59.1375  61.44986  60.0160  61.1190 
##  algo1_vec(n)  10.091  10.8340  11.60346  11.3045  11.7735 
##  algo1_cpp(n)   5.493   6.0765   7.13512   6.6210   7.2775 
##      max neval cld 
##  789.799   100   c 
##  105.260   100  b  
##   23.007   100 a   
##   22.131   100 a 

It is amazing that the C++ version is even faster than the vectorized version. Although the functions used by the vectorized version are primitive functions and are already very fast, they still have some overhead due to method dispatching and argument checking. Our C++ version is specialized to the task, so it can be slightly faster than the vectorized version.

Another example is the C++ implementation of diff_for() as the following code shows:

#include <Rcpp.h> 
usingnamespace Rcpp; 
 
// [[Rcpp::export]] 
NumericVector diff_cpp(NumericVector x) { 
  NumericVector res(x.size() - 1); 
  for (int i = 0; i < x.size() - 1; i++) { 
    res[i] = x[i + 1] - x[i]; 
  } 
  return res; 
} 

In the preceding C++ code, diff_cpp() takes a numeric vector and returns a numeric vector. The function simply creates a new vector, and calculates and stores the differences between the consecutive two elements in x iteratively. Then, we will source the code file:

Rcpp::sourceCpp("code/rcpp-diff.cpp") 

It is easy to verify whether the function works as supposed:

diff_cpp(c(1, 2, 3, 5)) 
## [1] 1 1 2 

Then, we will do the benchmarking again with five different calls: the version using a for loop in R (diff_for), the byte-code compiled version (diff_cmp), the vectorized version (diff), the vectorized version without method dispatch (diff.default), and our C++ version (diff_cpp):

x <- rnorm(1000) 
microbenchmark( 
  diff_for(x),  
  diff_cmp(x),  
  diff(x),  
  diff.default(x),  
  diff_cpp(x)) 
## Unit: microseconds 
##             expr      min        lq       mean    median 
##      diff_for(x) 1055.177 1113.8875 1297.82994 1282.9675 
##      diff_cmp(x)   75.511   78.4210   88.46485   88.2135 
##          diff(x)   12.899   14.9340   20.64854   18.3975 
##  diff.default(x)   10.750   11.6865   13.90939   12.6400 
##      diff_cpp(x)    5.314    6.4260    8.62119    7.5330 
##         uq      max neval cld 
##  1400.8250 2930.690   100   c 
##    90.3485  179.620   100  b  
##    24.2335   65.172   100 a   
##    15.3810   25.455   100 a   
##     8.9570   54.455   100 a 

It appears that the C++ version is the fastest.

In recent years, a rapidly growing number of R packages have used Rcpp to either boost performance or directly link to popular libraries that provide high-performance algorithms. For example, RcppArmadillo and RcppEigen provide high-performance linear algebra algorithms, RcppDE provides fast implementations for global optimization by differential evolution in C++, and so on.

To know more about Rcpp and related packages, visit its official website (http://www.rcpp.org/). I also recommend the bookSeamless R and C++ Integration with Rcpp by Rcpp's author Dirk Eddelbuettel at http://www.rcpp.org/book/.

OpenMP

As we mentioned in the section on parallel computing, an R session runs in a single thread. However, in Rcpp code, we can use multithreading to boost the performance. One multithreading technique is OpenMP (http://openmp.org), which is supported by most modern C++ compilers (see http://openmp.org/wp/openmp-compilers/).

Several articles discuss and demonstrate the use of OpenMP with Rcpp at http://gallery.rcpp.org/tags/openmp/. Here, we will provide a simple example. We will create a C++ source file with the following code at code/rcpp-diff-openmp.cpp:

// [[Rcpp::plugins(openmp)]] 
#include <omp.h> 
#include <Rcpp.h> 
usingnamespace Rcpp; 
 
// [[Rcpp::export]] 
NumericVector diff_cpp_omp(NumericVector x) { 
  omp_set_num_threads(3); 
  NumericVector res(x.size() - 1); 
#pragma omp parallel for  
  for (int i = 0; i < x.size() - 1; i++) { 
    res[i] = x[i + 1] - x[i]; 
  } 
  return res; 
} 

Note that Rcpp will recognize the comment in the first line and add necessary options to the compiler so that OpenMP is enabled. To use OpenMP, we need to include omp.h. Then, we can set the number of threads by calling omp_set_num_threads(n) and use #pragma omp parallel for to indicate that the following for loop should be parallelized. If the number of threads is set to 1, then the code also runs normally.

We will source the C++ code file:

Rcpp::sourceCpp("code/rcpp-diff-openmp.cpp") 

First, let's see whether the function works properly:

diff_cpp_omp(c(1, 2, 4, 8)) 
## [1] 1 2 4 

Then, we will start benchmarking with a 1000-number input vector:

x <- rnorm(1000) 
microbenchmark( 
  diff_for(x),  
  diff_cmp(x),  
  diff(x),  
  diff.default(x),  
  diff_cpp(x), 
  diff_cpp_omp(x)) 
## Unit: microseconds 
##             expr      min        lq       mean    median 
##      diff_for(x) 1010.367 1097.9015 1275.67358 1236.7620 
##      diff_cmp(x)   75.729   78.6645   88.20651   88.9505 
##          diff(x)   12.615   16.4200   21.13281   20.5400 
##  diff.default(x)   10.555   12.1690   16.07964   14.8210 
##      diff_cpp(x)    5.640    6.4825    8.24118    7.5400 
##  diff_cpp_omp(x)    3.505    4.4390   26.76233    5.6625 
##         uq      max neval cld 
##  1393.5430 2839.485   100   c 
##    94.3970  186.660   100  b  
##    24.4260   43.893   100 a   
##    18.4635   72.940   100 a   
##     8.6365   50.533   100 a   
##    13.9585 1430.605   100 a 

Unfortunately, even with multi-threading, diff_cpp_omp() is slower than its single-threaded C++ implementation. This is because using multithreading has some overhead. If the input is small, the time to initialize multiple threads may take a significant part of the whole computing time. However, if the input is large enough, the advantage of multi-threading will exceed its cost. Here, we will use 100000 numbers as the input vector:

x <- rnorm(100000) 
microbenchmark( 
  diff_for(x),  
  diff_cmp(x),  
  diff(x),  
  diff.default(x),  
  diff_cpp(x), 
  diff_cpp_omp(x)) 
## Unit: microseconds 
##             expr        min          lq        mean 
##      diff_for(x) 112216.936 114617.4975 121631.8135 
##      diff_cmp(x)   7355.241   7440.7105   8800.0184 
##          diff(x)    863.672    897.2060   1595.9434 
##  diff.default(x)    844.186    877.4030   3451.6377 
##      diff_cpp(x)    418.207    429.3125    560.3064 
##  diff_cpp_omp(x)    125.572    149.9855    237.5871 
##      median          uq        max neval cld 
##  115284.377 116165.3140 214787.857   100   c 
##    7537.405   8439.9260 102712.582   100  b  
##    1029.642   2195.5620   8020.990   100 a   
##     931.306   2365.6920  99832.513   100 a   
##     436.638    552.5110   2165.091   100 a   
##     166.834    190.7765   1983.299   100 a 

The cost of creating multiple threads is small relative to the performance boost of using them. As a result, the version powered by OpenMP is even faster than the simple C++ version.

In fact, the feature set of OpenMP is much richer than we have demonstrated. For more details, read the official documentation. For more examples, I recommend Guide into OpenMP: Easy multithreading programming for C++ by Joel Yliluoma at http://bisqwit.iki.fi/story/howto/openmp/.

RcppParallel

Another approach to taking advantage of multi-threading with Rcpp is RcppParallel (http://rcppcore.github.io/RcppParallel/). This package includes Intel TBB (https://www.threadingbuildingblocks.org/) and TinyThread (http://tinythreadpp.bitsnbites.eu/). It provides thread-safe vector and matrix wrapper data structures as well as high-level parallel functions.

To perform multi-threading parallel computing with RcppParallel, we need to implement a Worker to handle how a slice of input is transformed to the output. Then, RcppParallel will take care of the rest of the work such as multithreading task scheduling.

Here's a short demo. We will create a C++ source file with the following code at code/rcpp-parallel.cpp. Note that we need to declare to Rcpp that it depends on RcppParallel and uses C++ 11 for using lambda function.

Here, we will implement a Worker called Transformer that transforms each element x of a matrix to 1 / (1 + x ^ 2). Then, in par_transform, we will create an instance of Transformer and call parallelFor with it so that it automatically takes advantage of multithreading:

// [[Rcpp::plugins(cpp11)]]
 // [[Rcpp::depends(RcppParallel)]]
 #include <Rcpp.h>
 #include <RcppParallel.h>
 
 using namespace Rcpp;
 using namespace RcppParallel;
 
 struct Transformer : public Worker {
   const RMatrix<double> input;
   RMatrix<double> output;
   Transformer(const NumericMatrix input, NumericMatrix output)
     : input(input), output(output) {}
   void operator()(std::size_t begin, std::size_t end) {
     std::transform(input.begin() + begin, input.begin() + end,
       output.begin() + begin, [](double x) {
         return 1 / (1 + x * x);
       });
   }
 };
 
// [[Rcpp::export]]
NumericMatrix par_transform (NumericMatrix x) {
  NumericMatrix output(x.nrow(), x.ncol());
  Transformer transformer(x, output);
  parallelFor(0, x.length(), transformer);
  return output;
}

We can easily verify that the function works with a small matrix:

mat <- matrix(1:12, nrow = 3) 
mat 
##      [,1] [,2] [,3] [,4] 
## [1,]    1    4    7   10 
## [2,]    2    5    8   11 
## [3,]    3    6    9   12 
par_transform(mat) 
##      [,1]       [,2]       [,3]        [,4] 
## [1,]  0.5 0.05882353 0.02000000 0.009900990 
## [2,]  0.2 0.03846154 0.01538462 0.008196721 
## [3,]  0.1 0.02702703 0.01219512 0.006896552 
all.equal(par_transform(mat), 1 /(1 + mat ^ 2)) 
## [1] TRUE 

It produces exactly the same results as the vectorized R expression. Now, we can take a look at its performance when the input matrix is very large:

mat <- matrix(rnorm(1000 * 2000), nrow = 1000) 
microbenchmark(1 /(1 + mat ^ 2), par_transform(mat)) 
## Unit: milliseconds 
##                expr      min        lq     mean    median 
##       1/(1 + mat ^ 2) 14.50142 15.588700 19.78580 15.768088 
##  par_transform(mat)  7.73545  8.654449 13.88619  9.277798 
##        uq      max neval cld 
##  18.79235 127.1912   100   b 
##  11.65137 110.6236   100  a 

It appears that the multi-threading version is almost 1x faster than the vectorized version.

RcppParallel is more powerful than we have demonstrated. For more detailed introduction and examples, visit http://rcppcore.github.io/RcppParallel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset