In the previous section, we demonstrated how to use profiling tools to identify a performance bottleneck in the code. In this section, you will learn about a number of approaches to boosting code performance.
Previously, we demonstrated the performance difference between my_cumsum1()
, my_cumsum2()
and the built-in function cumsum()
. Although my_cumsum2()
is faster than my_cumsum1()
, when the input vector contains many numbers, cumsum()
is much faster than them. Also, its performance does not decay significantly even as the input gets longer. If we evaluate cumsum
, we can see that it is a primitive function:
cumsum ## function (x) .Primitive("cumsum")
A primitive function in R is implemented in C/C++/Fortran, compiled to native instructions, and thus, is extremely efficient. Another example is diff()
. Here, we will implement computing vector difference sequence in R:
diff_for <- function(x) { n <- length(x) - 1 res <- numeric(n) for (i in seq_len(n)) { res[[i]] <- x[[i + 1]] - x[[i]] } res }
We can verify that the implementation is correct:
diff_for(c(2, 3, 1, 5)) ## [1] 1 -2 4
Therefore, both diff_for()
and built-in diff()
must return the same result for the same input:
x <- rnorm(1000) all.equal(diff_for(x), diff(x)) ## [1] TRUE
However, there's a big gap in performance between the two functions.
microbenchmark(diff_for(x), diff(x)) ## Unit: microseconds ## expr min lq mean median ## diff_for(x) 1034.028 1078.9075 1256.01075 1139.1270 ## diff(x) 12.057 14.2535 21.72772 17.5705 ## uq max neval cld ## 1372.1145 2634.128 100 b ## 25.4525 75.850 100 a
Built-in functions are, in most cases, way faster than equivalent R implementations. This is true not only for vector functions, but also for matrices. For example, here is a simple 3 by 4 integer matrix:
mat <- matrix(1:12, nrow = 3) mat ## [,1] [,2] [,3] [,4] ## [1,] 1 4 7 10 ## [2,] 2 5 8 11 ## [3,] 3 6 9 12
We can write a function to transpose the matrix:
my_transpose <- function(x) { stopifnot(is.matrix(x)) res <- matrix(vector(mode(x), length(x)), nrow = ncol(x), ncol = nrow(x), dimnames = dimnames(x)[c(2, 1)]) for (i in seq_len(ncol(x))) { for (j in seq_len(nrow(x))) { res[i, j] <- x[j, i] } } res }
In the function, we will first create a matrix of the same type as the input, but with the number and names of rows and columns exchanged, respectively. Then, we will iterate over columns and rows to transpose the matrix:
my_transpose(mat) ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 ## [4,] 10 11 12
The built-in function of matrix transpose is t()
. We can easily verify that both functions return the same results:
all.equal(my_transpose(mat), t(mat)) ## [1] TRUE
However, they may exhibit great difference in performance:
microbenchmark(my_transpose(mat), t(mat)) ## Unit: microseconds ## expr min lq mean median uq ## my_transpose(mat) 22.795 24.633 29.47941 26.0865 35.5055 ## t(mat) 1.576 1.978 2.87349 2.3375 2.7695 ## max neval cld ## 71.509 100 b ## 16.171 100 a
The performance difference gets even more significant when the input matrix is larger. Here, we will create a new matrix with 1000
rows and 25
columns. While the results are the same, the performance can be very different:
mat <- matrix(rnorm(25000), nrow = 1000) all.equal(my_transpose(mat), t(mat)) ## [1] TRUE microbenchmark(my_transpose(mat), t(mat)) ## Unit: microseconds ## expr min lq mean ## my_transpose(mat) 21786.241 22456.3990 24466.055 ## t(mat) 36.611 46.2045 61.047 ## median uq max neval cld ## 23821.5905 24225.142 113395.811 100 b ## 57.7505 68.694 142.126 100 a
Note that t()
is a generic function that works with both matrix and data frame. S3 dispatching to find the right method for the input, also has some overhead. Therefore, directly calling t.default()
on a matrix is slightly faster:
microbenchmark(my_transpose(mat), t(mat), t.default(mat)) ## Unit: microseconds ## expr min lq mean ## my_transpose(mat) 21773.751 22498.6420 23673.26089 ## t(mat) 37.853 48.8475 63.57713 ## t.default(mat) 35.518 41.0305 52.97680 ## median uq max neval cld ## 23848.6625 24139.7675 29034.267 100 b ## 61.3565 69.6655 140.061 100 a ## 46.3095 54.0655 146.755 100 a
All previous examples show that, in most cases, it is much better to use built-in functions if provided than reinventing the wheel in R. These functions get rid of the overhead of R code and, thus, can be extremely efficient even if the input is huge.
A special subset of built-in functions are arithmetic operators such as +
, -
, *
, /
, ^
, and %%
. These operators are not only extremely efficient but also vectorized.
Suppose we implement +
in R:
add <- function(x, y) { stopifnot(length(x) == length(y), is.numeric(x), is.numeric(y)) z <- numeric(length(x)) for (i in seq_along(x)) { z[[i]] <- x[[i]] + y[[i]] } z }
Then, we would randomly generate x
and y
. The add(x, y)
, and x + y
arguments should return exactly the same results:
x <- rnorm(10000) y <- rnorm(10000) all.equal(add(x, y), x + y) ## [1] TRUE
The following benchmarking shows that the performance difference is huge:
microbenchmark(add(x, y), x + y) ## Unit: microseconds ## expr min lq mean median ## add(x, y) 9815.495 10055.7045 11478.95003 10712.7710 ## x + y 10.260 12.0345 17.31862 13.3995 ## uq max neval cld ## 12598.366 18754.504 100 b ## 22.208 56.969 100 a
Now, suppose we need to calculate the sum of the reciprocal of first n
positive integers squared. We can easily implement the algorithm using a for
loop as the following function algo1_for
:
algo1_for <- function(n) { res <- 0 for (i in seq_len(n)) { res <- res + 1 /i ^ 2 } res }
The function takes an input n
, iterates n
times to accumulate as supposed, and returns the result.
A better approach is to use vectorized calculation directly without any necessity of a for
loop, just like how algo1_vec()
is implemented:
algo1_vec <- function(n) { sum(1 / seq_len(n) ^ 2) }
The two functions yield the same results, given an ordinary input:
algo1_for(10) ## [1] 1.549768 algo1_vec(10) ## [1] 1.549768
However, their performance is very different:
microbenchmark(algo1_for(200), algo1_vec(200)) ## Unit: microseconds ## expr min lq mean median uq ## algo1_for(200) 91.727 101.2285 104.26857 103.6445 105.632 ## algo1_vec(200) 2.465 2.8015 3.51926 3.0355 3.211 ## max neval cld ## 206.295 100 b ## 19.426 100 a microbenchmark(algo1_for(1000), algo1_vec(1000)) ## Unit: microseconds ## expr min lq mean median ## algo1_for(1000) 376.335 498.9320 516.63954 506.859 ## algo1_vec(1000) 8.718 9.1175 9.82515 9.426 ## uq max neval cld ## 519.2420 1823.502 100 b ## 9.8955 20.564 100 a
Vectorization is a highly recommended way of writing R code. It is not only of high performance but, also makes the code easier to understand.
In the previous section, we saw the power of vectorization. Sometimes, however, the problem dictates a for loop, and it is hard to vectorize the code. In this case, we may consider using R byte-code compiler to compile the function so that the function no longer needs parsing and may run faster.
First, we will load the compiler package, which is distributed along with R. We will use cmpfun()
to compile a given R function. For example, we will compile diff_for()
and store the compiled function as diff_cmp()
:
library(compiler) diff_cmp <- cmpfun(diff_for) diff_cmp ## function(x) { ## n <- length(x) - 1 ## res <- numeric(n) ## for (i in seq_len(n)) { ## res[[i]] <- x[[i + 1]] - x[[i]] ## } ## res ## } ## <bytecode: 0x93aec08>
When we look at diff_cmp()
, it does not look very different from diff_for()
, but it has an additional tag of the bytecode
address.
Then, we will run the benchmarking again with diff_cmp()
this time:
x <- rnorm(10000) microbenchmark(diff_for(x), diff_cmp(x), diff(x)) ## Unit: microseconds ## expr min lq mean median ## diff_for(x) 10664.387 10940.0840 11684.3285 11357.9330 ## diff_cmp(x) 732.110 740.7610 760.1985 751.0295 ## diff(x) 80.824 91.2775 107.8473 103.8535 ## uq max neval cld ## 12179.98 16606.291 100 c ## 763.66 1015.234 100 b ## 115.11 219.396 100 a
It looks amazing that the compiled version, diff_cmp()
, is much faster than diff_for()
even though we didn't modify anything but compiled it into bytecode.
Now, we will do the same thing with algo1_for()
:
algo1_cmp <- cmpfun(algo1_for) algo1_cmp ## function(n) { ## res <- 0 ## for (i in seq_len(n)) { ## res <- res + 1 / i ^ 2 ## } ## res ## } ## <bytecode: 0xa87e2a8>
Then, we will conduct the benchmarking with the compiled version included:
n <- 1000 microbenchmark(algo1_for(n), algo1_cmp(n), algo1_vec(n)) ## Unit: microseconds ## expr min lq mean median uq ## algo1_for(n) 490.791 499.5295 509.46589 505.7560 517.5770 ## algo1_cmp(n) 55.588 56.8355 58.10490 57.8270 58.7140 ## algo1_vec(n) 8.688 9.2150 9.79685 9.4955 9.8895 ## max neval cld ## 567.680 100 c ## 69.734 100 b ## 19.765 100 a
Again, the compiled version becomes more than six times faster than the original version, even if we didn't change a bit of code.
However, compiling is no magic if it is used to compile a fully vectorized function. Here, we will compile algo1_vec()
and compare its performance with the original version:
algo1_vec_cmp <- cmpfun(algo1_vec) microbenchmark(algo1_vec(n), algo1_vec_cmp(n), times = 10000) ## Unit: microseconds ## expr min lq mean median uq ## algo1_vec(n) 8.47 8.678 20.454858 8.812 9.008 ## algo1_vec_cmp(n) 8.35 8.560 9.701012 8.687 8.864 ## max neval cld ## 96376.483 10000 a ## 1751.431 10000 a
Note that the compiled function shows no significant performance improvement. To know more about how the compiler works, type ?compile
and read the documentation.
The R distribution we normally use is single threaded, that is, only one CPU thread is used to execute all R code. The good thing is that the execution model is simple and safe, but it does not take advantage of multicore computing.
Microsoft R Open (MRO, see https://mran.microsoft.com/open/) is an enhanced distribution of R. Powered by Intel Math Kernel Library (MKL, see https://software.intel.com/en-us/intel-mkl), MRO enhances the matrix algorithms by automatically taking advantage of multithreading computation. On a multicore computer, MRO can be 10-80 times faster than the official R implementation at matrix multiplication, Cholesky factorization, QR decomposition, singular value decomposition, principal component analysis, and linear discriminant analysis. For more details, visit https://mran.microsoft.com/documents/rro/multithread/ and see the benchmarking.
As we mentioned in the previous section, R is single threaded in design but still allows multiprocessing parallel computing, that is, running multiple R sessions to compute. This technique is supported by a parallel library, which is also distributed along with R.
Suppose we need to do a simulation: we need to generate a random path that follows a certain random process and see whether at any point, the value goes beyond a fixed margin around the starting point.
The following code generates one realization:
set.seed(1) sim_data <- 100 * cumprod(1 + rnorm(500, 0, 0.006)) plot(sim_data, type = "s", ylim = c(85, 115), main = "A simulated random path") abline(h = 100, lty = 2, col = "blue") abline(h = 100 * (1 + 0.1 * c(1, -1)), lty = 3, col = "red")
The plot generated is shown as follows:
The preceding graph shows the path and 10 percent margin. It is clear that between index 300 and 500, the value goes beyond the upper margin multiple times.
This is just one path. A valid simulation requires that the generator run as many times as necessary to produce statistically meaningful results. The following function parameterizes the random path generator and returns a list of summary indicators of interest. Note that signal
indicates whether any point on the path goes beyond the margin:
simulate <- function(i, p = 100, n = 10000, r = 0, sigma = 0.0005, margin = 0.1) { ps <- p * cumprod(1 + rnorm(n, r, sigma)) list(id = i, first = ps[[1]], high = max(ps), low = min(ps), last = ps[[n]], signal = any(ps > p * (1 + margin) | ps < p * (1 - margin))) }
Then, we can run the generator for one time and see its summarized result:
simulate(1) ## $id ## [1] 1 ## ## $first ## [1] 100.0039 ## ## $high ## [1] 101.4578 ## ## $low ## [1] 94.15108 ## ## $last ## [1] 96.13973 ## ## $signal ## [1] FALSE
To perform the simulation, we need to run the function many times. In practice, we may need to run at least millions of realizations, which may take us a considerable amount of time. Here, we will measure how much time it costs to run ten thousand iterations of this simulation:
system.time(res <- lapply(1:10000, simulate)) ## user system elapsed ## 8.768 0.000 8.768
When the simulation is finished, we can convert all results into one data table:
library(data.table) res_table <- rbindlist(res) head(res_table) ## id first high low last signal ## 1: 1 100.03526 100.7157 93.80330 100.55324 FALSE ## 2: 2 100.03014 104.7150 98.85049 101.97831 FALSE ## 3: 3 99.99356 104.9834 95.28500 95.59243 FALSE ## 4: 4 99.93058 103.4315 96.10691 97.22223 FALSE ## 5: 5 99.99785 100.6041 94.12958 95.97975 FALSE ## 6: 6 100.03235 102.1770 94.65729 96.49873 FALSE
We can calculate the realized probability of signal == TRUE
:
res_table[, sum(signal) /.N] ## [1] 0.0881
What if the problem gets more practical and requires us to run millions of times? In this case, some researchers may turn to programming languages implemented with much higher performance such as C and C++, which are extremely efficient and flexible. They are great tools in implementing algorithms but require more effort to deal with the compiler, linker, and data input/output.
Note that each iteration in the preceding simulation is completely independent of each other, so it is better accomplished by parallel computing.
Since different operating systems have different implementations of process and threading model, some features that are available for Linux and MacOS are not available for Windows. Thus, performing parallel computing on Windows can be a bit more verbose.
On Windows, we need to create a local cluster of multiple R sessions to run parallel computing:
library(parallel) cl <- makeCluster(detectCores())
The detectCores()
function returns the number of cores your computer is equipped with. Creating a cluster of more than that number of nodes is allowed but usually does no good because your computer cannot perform more tasks than that simultaneously.
Then, we can call parLapply()
, the parallel version of lapply()
:
system.time(res <- parLapply(cl, 1:10000, simulate)) ## user system elapsed ## 0.024 0.008 3.772
Note that the time consumed is reduced to more than half of the original time. Now, we no longer need the cluster. We can call stopCluster()
to kill the R sessions just created:
stopCluster(cl)
When we call parLapply()
, it automatically schedules the task for each cluster node. More specifically, all cluster nodes run simulate()
with one of 1:10000
exclusively at the same time so that the computation is done in parallel. Finally, all results are collected so that we get a list just like the results from lapply()
:
length(res) ## [1] 10000 res_table <- rbindlist(res) res_table[, sum(signal) /.N] ## [1] 0.0889
The parallel code looks simple because simulate()
is self-contained and does not rely on user-defined external variables or datasets. If we run a function in parallel that refers to a variable in the master session (the current session that creates the cluster), it will not find the variable:
cl <- makeCluster(detectCores()) n <- 1 parLapply(cl, 1:3, function(x) x + n) ## Error in checkForRemoteErrors(val): 3 nodes produced errors; first error: object 'n' not found stopCluster(cl)
All nodes fail because each of them starts as a fresh R session with no user variables defined. To let the cluster nodes get the value of the variable they need, we have to export them to all nodes.
The following example demonstrates how this works. Suppose we have a data frame of numbers. We want to take random samples from the data frame:
n <- 100 data <- data.frame(id = 1:n, x = rnorm(n), y = rnorm(n)) take_sample <- function(n) { data[sample(seq_len(nrow(data)), size = n, replace = FALSE), ] }
If we perform the sampling in parallel, all nodes must share the data frame and the function. To do this, we can use clusterEvalQ()
to evaluate an expression on each cluster node. First, we will make a cluster just as we did earlier:
cl <- makeCluster(detectCores())
The Sys.getpid()
function returns the process ID of the current R session. Since there are four nodes in the cluster, each is an R session with a unique process ID. We can call clusterEvalQ()
with Sys.getpid()
and see the process ID of each node:
clusterEvalQ(cl, Sys.getpid()) ## [[1]] ## [1] 20714 ## ## [[2]] ## [1] 20723 ## ## [[3]] ## [1] 20732 ## ## [[4]] ## [1] 20741
To see the variables in the global environment of each node, we can call ls()
, just like we call in our own working environment:
clusterEvalQ(cl, ls()) ## [[1]] ## character(0) ## ## [[2]] ## character(0) ## ## [[3]] ## character(0) ## ## [[4]] ## character(0)
As we mentioned, all cluster nodes are, by default, initialized with an empty global environment. To export data
and take_sample
to each node, we can call clusterExport()
:
clusterExport(cl, c("data", "take_sample")) clusterEvalQ(cl, ls()) ## [[1]] ## [1] "data" "take_sample" ## ## [[2]] ## [1] "data" "take_sample" ## ## [[3]] ## [1] "data" "take_sample" ## ## [[4]] ## [1] "data" "take_sample"
Now, we can see that all nodes have data
and take_sample
. Now, we can let each node call take_sample()
:
clusterEvalQ(cl, take_sample(2)) ## [[1]] ## id x y ## 88 88 0.6519981 1.43142886 ## 80 80 0.7985715 -0.04409101 ## ## [[2]] ## id x y ## 65 65 -0.4705287 -1.0859630 ## 35 35 0.6240227 -0.3634574 ## ## [[3]] ## id x y ## 75 75 0.3994768 -0.1489621 ## 8 8 1.4234844 1.8903637 ## ## [[4]] ## id x y ## 77 77 0.4458477 1.420187 ## 9 9 0.3943990 -0.196291
Alternatively, we can use clusterCall()
and <<-
to create global variables in each node, while <-
only creates local variables in the function:
invisible(clusterCall(cl, function() { local_var <- 10 global_var <<- 100 })) clusterEvalQ(cl, ls()) ## [[1]] ## [1] "data" "global_var" "take_sample" ## ## [[2]] ## [1] "data" "global_var" "take_sample" ## ## [[3]] ## [1] "data" "global_var" "take_sample" ## ## [[4]] ## [1] "data" "global_var" "take_sample"
Note that clusterCall()
returns the returned value from each node. In the preceding code, we will use invisible()
to suppress the values they return.
Since each cluster node is started in a fresh state, they only load basic packages. To let each node load the given packages, we can also use clusterEvalQ()
. The following code lets each node attach the data.table
package so that parLapply()
can run a function in which data.table
functions are used on each node:
clusterExport(cl, "simulate") invisible(clusterEvalQ(cl, { library(data.table) })) res <- parLapply(cl, 1:3, function(i) { res_table <- rbindlist(lapply(1:1000, simulate)) res_table[, id := NULL] summary(res_table) })
A list of data summary is returned:
res ## [[1]] ## first high low ## Min. : 99.86 Min. : 99.95 Min. : 84.39 ## 1st Qu.: 99.97 1st Qu.:101.44 1st Qu.: 94.20 ## Median :100.00 Median :103.32 Median : 96.60 ## Mean :100.00 Mean :103.95 Mean : 96.04 ## 3rd Qu.:100.03 3rd Qu.:105.63 3rd Qu.: 98.40 ## Max. :100.17 Max. :121.00 Max. :100.06 ## last signal ## Min. : 84.99 Mode :logical ## 1st Qu.: 96.53 FALSE:911 ## Median : 99.99 TRUE :89 ## Mean : 99.92 NA's :0 ## 3rd Qu.:103.11 ## Max. :119.66 ## ## [[2]] ## first high low ## Min. : 99.81 Min. : 99.86 Min. : 83.67 ## 1st Qu.: 99.96 1st Qu.:101.48 1st Qu.: 94.32 ## Median :100.00 Median :103.14 Median : 96.42 ## Mean :100.00 Mean :103.91 Mean : 96.05 ## 3rd Qu.:100.04 3rd Qu.:105.76 3rd Qu.: 98.48 ## Max. :100.16 Max. :119.80 Max. :100.12 ## last signal ## Min. : 85.81 Mode :logical ## 1st Qu.: 96.34 FALSE:914 ## Median : 99.69 TRUE :86 ## Mean : 99.87 NA's :0 ## 3rd Qu.:103.31 ## Max. :119.39 ## ## [[3]] ## first high low ## Min. : 99.84 Min. : 99.88 Min. : 85.88 ## 1st Qu.: 99.97 1st Qu.:101.61 1st Qu.: 94.26 ## Median :100.00 Median :103.42 Median : 96.72 ## Mean :100.00 Mean :104.05 Mean : 96.12 ## 3rd Qu.:100.03 3rd Qu.:105.89 3rd Qu.: 98.35 ## Max. :100.15 Max. :117.60 Max. :100.03 ## last signal ## Min. : 86.05 Mode :logical ## 1st Qu.: 96.70 FALSE:920 ## Median :100.16 TRUE :80 ## Mean :100.04 NA's :0 ## 3rd Qu.:103.24 ## Max. :114.80
When we don't need the cluster any more, we will run the following code to release it:
stopCluster(cl)
Using parallel computing on Linux and MacOS can be much easier than on Windows. Without having to manually create a socket-based cluster, mclapply()
directly forks the current R session into multiple R sessions, with everything preserved to continue running in parallel and schedule tasks for each child R session:
system.time(res <- mclapply(1:10000, simulate, mc.cores = detectCores())) ## user system elapsed ## 9.732 0.060 3.415
Therefore, we don't have to export the variables because they are immediately available in each fork process:
mclapply(1:3, take_sample, mc.cores = detectCores()) ## [[1]] ## id x y ## 62 62 0.1679572 -0.5948647 ## ## [[2]] ## id x y ## 56 56 1.5678983 0.08655707 ## 39 39 0.1015022 -1.98006684 ## ## [[3]] ## id x y ## 98 98 0.13892696 -0.1672610 ## 4 4 0.07533799 -0.6346651 ## 76 76 -0.57345242 -0.5234832
Also, we can create jobs to be done in parallel with much flexibility. For example, we will create a job that generates 10
random numbers:
job1 <- mcparallel(rnorm(10), "job1")
As long as the job is created, we can choose to collect the results from the job with mccollect()
. Then, the function will not return until the job is finished:
mccollect(job1) ## $`20772` ## [1] 1.1295953 -0.6173255 1.2859549 -0.9442054 0.1482608 ## [6] 0.4242623 0.9463755 0.6662561 0.4313663 0.6231939
We can also programmatically create a number of jobs to run in parallel. For example, we create 8
jobs, and each sleeps for a random time. Then, mccollect()
won't return until all jobs are finished sleeping. Since the jobs are run in parallel, the time mccollect()
takes won't be too long:
jobs <- lapply(1:8, function(i) { mcparallel({ t <- rbinom(1, 5, 0.6) Sys.sleep(t) t }, paste0("job", i)) }) system.time(res <- mccollect(jobs)) ## user system elapsed ## 0.012 0.040 4.852
This allows us to customize the task-scheduling mechanism.
As we mentioned, parallel computing works when each iteration is independent so that the final results do not rely on the order of execution. However, not all tasks are so ideal like this. Therefore, the use of parallel computing may be undermined. What if we really want the algorithm to run fast and easily interact with R? The answer is by writing the algorithm in C++ via Rcpp (http://www.rcpp.org/).
C++ code usually runs very fast, because it is compiled to native instructions and is thus much closer to hardware level than a scripting language like R. Rcpp is a package that enables us to write C++ code with seamless R and C++ integration. With Rcpp, we can write C++ code in which we can call R functions and take advantage of R data structures. It allows us to write high-performance code and preserve the power of data manipulation in R at the same time.
To use Rcpp, we first need to ensure that the system is prepared for computing native code with the right toolchain. Under Windows, Rtools is needed and can be found at https://cran.r-project.org/bin/windows/Rtools/. Under Linux and MacOS, a properly installed C/C++ toolchain is required.
Once the toolchain is properly installed, run the following code to install the package:
install.packages("Rcpp")
Then, we will create a C++ source file at code/rcpp-demo.cpp
with the following code:
#include <Rcpp.h> usingnamespace Rcpp; // [[Rcpp::export]] NumericVector timesTwo(NumericVector x) { return x * 2; }
The preceding code is written in C++. If you are not familiar with C++ syntax, you can quickly pick up the simplest part by going through http://www.learncpp.com/. The language design and supported features are much richer and more complex than R. Don't expect to be an expert in a short period of time, but getting started with the basics usually allows you to write simple algorithms.
If you read the preceding code, it looks very different from typical R code. Since C++ is a strong-typed language, we need to specify the types of function arguments and the return type of functions. A function that is commented with [[Rcpp::export]]
will be captured by Rcpp, and when we source the code in RStudio or use Rcpp::sourceCpp
directly, these C++ functions will be automatically compiled and ported to our working environment in R.
The preceding C++ function simply takes a numeric vector and returns a new numeric vector with all x
elements doubled. Note that the NumericVector
class is provided by Rcpp.h
included at the beginning of the source file. In fact, Rcpp.h
provides the C++ proxy of all commonly used R data structures. Now, we will call Rcpp::sourceCpp()
to compile and load the source file:
Rcpp::sourceCpp("code/rcpp-demo.cpp")
The function compiles the source code, links it to necessary shared libraries, and exposes an R function to the environment. The beauty is that all of these are done automatically, which makes it much easier to write algorithms for non-professional C++ developers. Now, we have an R function to call it:
timesTwo ## function (x) ## .Primitive(".Call")(<pointer: 0x7f81735528c0>, x)
We can see that timeTwo
in R does not look like an ordinary function, but performs a native call to the C++ function. The function works with single numeric input:
timesTwo(10) ## [1] 20
It also works with a multi-element numeric vector:
timesTwo(c(1, 2, 3)) ## [1] 2 4 6
Now, we can use very simple C++ language constructs to reimplement the algo1_for
algorithm in C++. Now, we will create a C++ source file at code/rcpp-algo1.cpp
with the following code:
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] double algo1_cpp(int n) { double res = 0; for (double i = 1; i < n; i++) { res += 1 / (i * i); } return res; }
Note that we don't use any R but C++ data structures in algo1_cpp
. When we source the code, Rcpp will handle all the porting for us:
Rcpp::sourceCpp("code/rcpp-algo1.cpp")
The function works with a single numeric input:
algo1_cpp(10) ## [1] 1.539768
If we supply a numeric vector, an error will occur:
algo1_cpp(c(10, 15)) ## Error in eval(expr, envir, enclos): expecting a single value
Now, we can do the benchmarking again. This time, we will add algo1_cpp
to the list of alternative implementations. Here, we will compare the version using a for
loop in R, the byte-code compiled version using for loop in R, the vectorized version, and the C++ version:
n <- 1000 microbenchmark( algo1_for(n), algo1_cmp(n), algo1_vec(n), algo1_cpp(n)) ## Unit: microseconds ## expr min lq mean median uq ## algo1_for(n) 493.312 507.7220 533.41701 513.8250 531.5470 ## algo1_cmp(n) 57.262 59.1375 61.44986 60.0160 61.1190 ## algo1_vec(n) 10.091 10.8340 11.60346 11.3045 11.7735 ## algo1_cpp(n) 5.493 6.0765 7.13512 6.6210 7.2775 ## max neval cld ## 789.799 100 c ## 105.260 100 b ## 23.007 100 a ## 22.131 100 a
It is amazing that the C++ version is even faster than the vectorized version. Although the functions used by the vectorized version are primitive functions and are already very fast, they still have some overhead due to method dispatching and argument checking. Our C++ version is specialized to the task, so it can be slightly faster than the vectorized version.
Another example is the C++ implementation of diff_for()
as the following code shows:
#include <Rcpp.h> usingnamespace Rcpp; // [[Rcpp::export]] NumericVector diff_cpp(NumericVector x) { NumericVector res(x.size() - 1); for (int i = 0; i < x.size() - 1; i++) { res[i] = x[i + 1] - x[i]; } return res; }
In the preceding C++ code, diff_cpp()
takes a numeric vector and returns a numeric vector. The function simply creates a new vector, and calculates and stores the differences between the consecutive two elements in x
iteratively. Then, we will source the code file:
Rcpp::sourceCpp("code/rcpp-diff.cpp")
It is easy to verify whether the function works as supposed:
diff_cpp(c(1, 2, 3, 5)) ## [1] 1 1 2
Then, we will do the benchmarking again with five different calls: the version using a for loop in R (diff_for
), the byte-code compiled version (diff_cmp
), the vectorized version (diff
), the vectorized version without method dispatch (diff.default
), and our C++ version (diff_cpp
):
x <- rnorm(1000) microbenchmark( diff_for(x), diff_cmp(x), diff(x), diff.default(x), diff_cpp(x)) ## Unit: microseconds ## expr min lq mean median ## diff_for(x) 1055.177 1113.8875 1297.82994 1282.9675 ## diff_cmp(x) 75.511 78.4210 88.46485 88.2135 ## diff(x) 12.899 14.9340 20.64854 18.3975 ## diff.default(x) 10.750 11.6865 13.90939 12.6400 ## diff_cpp(x) 5.314 6.4260 8.62119 7.5330 ## uq max neval cld ## 1400.8250 2930.690 100 c ## 90.3485 179.620 100 b ## 24.2335 65.172 100 a ## 15.3810 25.455 100 a ## 8.9570 54.455 100 a
It appears that the C++ version is the fastest.
In recent years, a rapidly growing number of R packages have used Rcpp to either boost performance or directly link to popular libraries that provide high-performance algorithms. For example, RcppArmadillo and RcppEigen provide high-performance linear algebra algorithms, RcppDE provides fast implementations for global optimization by differential evolution in C++, and so on.
To know more about Rcpp and related packages, visit its official website (http://www.rcpp.org/). I also recommend the bookSeamless R and C++ Integration with Rcpp by Rcpp's author Dirk Eddelbuettel at http://www.rcpp.org/book/.
As we mentioned in the section on parallel computing, an R session runs in a single thread. However, in Rcpp code, we can use multithreading to boost the performance. One multithreading technique is OpenMP (http://openmp.org), which is supported by most modern C++ compilers (see http://openmp.org/wp/openmp-compilers/).
Several articles discuss and demonstrate the use of OpenMP with Rcpp at http://gallery.rcpp.org/tags/openmp/. Here, we will provide a simple example. We will create a C++ source file with the following code at code/rcpp-diff-openmp.cpp
:
// [[Rcpp::plugins(openmp)]] #include <omp.h> #include <Rcpp.h> usingnamespace Rcpp; // [[Rcpp::export]] NumericVector diff_cpp_omp(NumericVector x) { omp_set_num_threads(3); NumericVector res(x.size() - 1); #pragma omp parallel for for (int i = 0; i < x.size() - 1; i++) { res[i] = x[i + 1] - x[i]; } return res; }
Note that Rcpp will recognize the comment in the first line and add necessary options to the compiler so that OpenMP is enabled. To use OpenMP, we need to include omp.h
. Then, we can set the number of threads by calling omp_set_num_threads(n)
and use #pragma omp parallel for
to indicate that the following for loop should be parallelized. If the number of threads is set to 1
, then the code also runs normally.
We will source the C++ code file:
Rcpp::sourceCpp("code/rcpp-diff-openmp.cpp")
First, let's see whether the function works properly:
diff_cpp_omp(c(1, 2, 4, 8)) ## [1] 1 2 4
Then, we will start benchmarking with a 1000-number input vector:
x <- rnorm(1000) microbenchmark( diff_for(x), diff_cmp(x), diff(x), diff.default(x), diff_cpp(x), diff_cpp_omp(x)) ## Unit: microseconds ## expr min lq mean median ## diff_for(x) 1010.367 1097.9015 1275.67358 1236.7620 ## diff_cmp(x) 75.729 78.6645 88.20651 88.9505 ## diff(x) 12.615 16.4200 21.13281 20.5400 ## diff.default(x) 10.555 12.1690 16.07964 14.8210 ## diff_cpp(x) 5.640 6.4825 8.24118 7.5400 ## diff_cpp_omp(x) 3.505 4.4390 26.76233 5.6625 ## uq max neval cld ## 1393.5430 2839.485 100 c ## 94.3970 186.660 100 b ## 24.4260 43.893 100 a ## 18.4635 72.940 100 a ## 8.6365 50.533 100 a ## 13.9585 1430.605 100 a
Unfortunately, even with multi-threading, diff_cpp_omp()
is slower than its single-threaded C++ implementation. This is because using multithreading has some overhead. If the input is small, the time to initialize multiple threads may take a significant part of the whole computing time. However, if the input is large enough, the advantage of multi-threading will exceed its cost. Here, we will use 100000
numbers as the input vector:
x <- rnorm(100000) microbenchmark( diff_for(x), diff_cmp(x), diff(x), diff.default(x), diff_cpp(x), diff_cpp_omp(x)) ## Unit: microseconds ## expr min lq mean ## diff_for(x) 112216.936 114617.4975 121631.8135 ## diff_cmp(x) 7355.241 7440.7105 8800.0184 ## diff(x) 863.672 897.2060 1595.9434 ## diff.default(x) 844.186 877.4030 3451.6377 ## diff_cpp(x) 418.207 429.3125 560.3064 ## diff_cpp_omp(x) 125.572 149.9855 237.5871 ## median uq max neval cld ## 115284.377 116165.3140 214787.857 100 c ## 7537.405 8439.9260 102712.582 100 b ## 1029.642 2195.5620 8020.990 100 a ## 931.306 2365.6920 99832.513 100 a ## 436.638 552.5110 2165.091 100 a ## 166.834 190.7765 1983.299 100 a
The cost of creating multiple threads is small relative to the performance boost of using them. As a result, the version powered by OpenMP is even faster than the simple C++ version.
In fact, the feature set of OpenMP is much richer than we have demonstrated. For more details, read the official documentation. For more examples, I recommend Guide into OpenMP: Easy multithreading programming for C++ by Joel Yliluoma at http://bisqwit.iki.fi/story/howto/openmp/.
Another approach to taking advantage of multi-threading with Rcpp is RcppParallel (http://rcppcore.github.io/RcppParallel/). This package includes Intel TBB (https://www.threadingbuildingblocks.org/) and TinyThread (http://tinythreadpp.bitsnbites.eu/). It provides thread-safe vector and matrix wrapper data structures as well as high-level parallel functions.
To perform multi-threading parallel computing with RcppParallel, we need to implement a Worker
to handle how a slice of input is transformed to the output. Then, RcppParallel will take care of the rest of the work such as multithreading task scheduling.
Here's a short demo. We will create a C++ source file with the following code at code/rcpp-parallel.cpp
. Note that we need to declare to Rcpp that it depends on RcppParallel and uses C++ 11 for using lambda function.
Here, we will implement a Worker
called Transformer
that transforms each element x
of a matrix to 1 / (1 + x ^ 2)
. Then, in par_transform
, we will create an instance of Transformer
and call parallelFor
with it so that it automatically takes advantage of multithreading:
// [[Rcpp::plugins(cpp11)]] // [[Rcpp::depends(RcppParallel)]] #include <Rcpp.h> #include <RcppParallel.h> using namespace Rcpp; using namespace RcppParallel; struct Transformer : public Worker { const RMatrix<double> input; RMatrix<double> output; Transformer(const NumericMatrix input, NumericMatrix output) : input(input), output(output) {} void operator()(std::size_t begin, std::size_t end) { std::transform(input.begin() + begin, input.begin() + end, output.begin() + begin, [](double x) { return 1 / (1 + x * x); }); } }; // [[Rcpp::export]] NumericMatrix par_transform (NumericMatrix x) { NumericMatrix output(x.nrow(), x.ncol()); Transformer transformer(x, output); parallelFor(0, x.length(), transformer); return output; }
We can easily verify that the function works with a small matrix:
mat <- matrix(1:12, nrow = 3) mat ## [,1] [,2] [,3] [,4] ## [1,] 1 4 7 10 ## [2,] 2 5 8 11 ## [3,] 3 6 9 12 par_transform(mat) ## [,1] [,2] [,3] [,4] ## [1,] 0.5 0.05882353 0.02000000 0.009900990 ## [2,] 0.2 0.03846154 0.01538462 0.008196721 ## [3,] 0.1 0.02702703 0.01219512 0.006896552 all.equal(par_transform(mat), 1 /(1 + mat ^ 2)) ## [1] TRUE
It produces exactly the same results as the vectorized R expression. Now, we can take a look at its performance when the input matrix is very large:
mat <- matrix(rnorm(1000 * 2000), nrow = 1000) microbenchmark(1 /(1 + mat ^ 2), par_transform(mat)) ## Unit: milliseconds ## expr min lq mean median ## 1/(1 + mat ^ 2) 14.50142 15.588700 19.78580 15.768088 ## par_transform(mat) 7.73545 8.654449 13.88619 9.277798 ## uq max neval cld ## 18.79235 127.1912 100 b ## 11.65137 110.6236 100 a
It appears that the multi-threading version is almost 1x faster than the vectorized version.
RcppParallel is more powerful than we have demonstrated. For more detailed introduction and examples, visit http://rcppcore.github.io/RcppParallel.