Profiling code

In the previous section, you learned how to use microbenchmark() to benchmark expressions. This can be useful when we have several alternative solutions to a problem and want to see which has better performance and when we optimize an expression and want to see whether the performance actually gets better than the original code.

However, it is usually the case that, when we feel the code is slow, it is not easy to locate the expression that contributes most to slowing down the entire program. Such an expression is called a "performance bottleneck." To improve code performance, it is best to resolve the bottleneck first.

Fortunately, R provides profiling tools to help us find the bottleneck, that is, the code that runs most slowly, which should be the top focus for improving code performance.

Profiling code with Rprof

R provides a built-in function, Rprof(), for code profiling. When profiling starts, a sampling procedure is running with all subsequent code until the profiling is ended. The sampling basically looks at which function R is executing every 20 milliseconds by default. In this way, if a function is very slow, it is likely that most of the execution time is spent on that function call.

The sampling approach may not produce very accurate results, but it serves our purpose in most cases. In the following example, we will use Rprof() to profile the code in which we call my_cumsum1() and try to find which part slows down the code.

The way of using Rprof() is very simple: call Rprof() to start profiling, run the code you want to profile, call Rprof(NULL) to stop profiling, and finally call summaryRprof() to see the profiling summary:

x <- rnorm(1000) 
tmp <- tempfile(fileext = ".out") 
Rprof(tmp) 
for (i in 1:1000) { 
  my_cumsum1(x) 
} 
Rprof(NULL) 
summaryRprof(tmp) 
## $by.self 
##              self.time self.pct total.time total.pct 
## "c"               2.42    82.88       2.42     82.88 
## "my_cumsum1"      0.46    15.75       2.92    100.00 
## "+"               0.04     1.37       0.04      1.37 
## $by.total 
##              total.time total.pct self.time self.pct 
## "my_cumsum1"       2.92    100.00      0.46    15.75 
## "c"                2.42     82.88      2.42    82.88 
## "+"                0.04      1.37      0.04     1.37 
##  
## $sample.interval 
## [1] 0.02 
##  
## $sampling.time 
## [1] 2.92 

Note that we used tempfile() to create a temporary file to store profiling data. If we don't supply such a file to Rprof(), it will automatically create Rprof.out in the current working directory. The default also applies to summaryRprof().

The profiling results summarize the profiling data into a readable format: $by.self sorts the timing by self.time, while $by.total sorts by total.time. More specifically, the self.time of a function is the time spent executing code in the function only, and the total.time of a function is the total execution time of the function.

To figure out which part slows down the function, we should pay more attention to self.time because it addresses the independent time of execution of each function.

The preceding profiling results show that c takes up a major part of the execution time, that is, y <- c(y, sum_x) contributes most to slowing down the function.

We can do the same thing to my_cumsum2(). The profiling results suggest that most time is spent on my_cumsum2(), but that is normal because that's the only thing we do in the code. No particular function in my_cumsum2() takes up major part of time to execute:

tmp <- tempfile(fileext = ".out") 
Rprof(tmp) 
for (i in 1:1000) { 
  my_cumsum2(x) 
} 
Rprof(NULL) 
summaryRprof(tmp) 
## $by.self 
##              self.time self.pct total.time total.pct 
## "my_cumsum2"      1.42    97.26       1.46    100.00 
## "-"               0.04     2.74       0.04      2.74 
##  
## $by.total 
##              total.time total.pct self.time self.pct 
## "my_cumsum2"       1.46    100.00      1.42    97.26 
## "-"                0.04      2.74      0.04     2.74 
##  
## $sample.interval 
## [1] 0.02 
##  
## $sampling.time 
## [1] 1.46 

In a practical situation, the code we want to profile is usually complicated enough. It may involve many different functions. Such a profiling summary can be less helpful if we only see the timing of each function it tracks. Fortunately, Rprof() supports line profiling, that is, it can tell us the timing of each line of code when we specify line.profiling = TRUE and use source(..., keep.source = TRUE).

We will create a script file at code/my_cumsum1.R with the following code:

my_cumsum1 <- function(x) {
   y <- numeric()
   sum_x <- 0
   for (xi in x) {
     sum_x <- sum_x + xi
     y <- c(y, sum_x)
   }
   y
 }
 
 x <- rnorm(1000)
 
 for (i in 1:1000) {
   my_cumsum1(x)
 }

Then, we will profile this script file with Rprof() and source():

tmp <- tempfile(fileext = ".out") 
Rprof(tmp, line.profiling = TRUE) 
source("code/my_cumsum1.R", keep.source = TRUE) 
Rprof(NULL) 
summaryRprof(tmp, lines = "show") 
## $by.self 
##                self.time self.pct total.time total.pct 
## my_cumsum1.R#6      2.38    88.15       2.38     88.15 
## my_cumsum1.R#5      0.26     9.63       0.26      9.63 
## my_cumsum1.R#4      0.06     2.22       0.06      2.22 
##  
## $by.total 
##                 total.time total.pct self.time self.pct 
## my_cumsum1.R#14       2.70    100.00      0.00     0.00 
## my_cumsum1.R#6        2.38     88.15      2.38    88.15 
## my_cumsum1.R#5        0.26      9.63      0.26     9.63 
## my_cumsum1.R#4        0.06      2.22      0.06     2.22 
##  
## $by.line 
##                 self.time self.pct total.time total.pct 
## my_cumsum1.R#4       0.06     2.22       0.06      2.22 
## my_cumsum1.R#5       0.26     9.63       0.26      9.63 
## my_cumsum1.R#6       2.38    88.15       2.38     88.15 
## my_cumsum1.R#14      0.00     0.00       2.70    100.00 
##  
## $sample.interval 
## [1] 0.02 
##  
## $sampling.time 
## [1] 2.7 

This time, it no longer shows function names but line numbers in the script file. We can easily locate the lines that cost most time by looking at the top rows in $by.self. The my_cumsum1.R#6 file refers to y <- c(y, sum_x), which is consistent with the previous profiling results.

Profiling code with profvis

The Rprof() function provides useful information to help us find which part of the code is too slow so that we can improve the implementation. RStudio also released an enhanced profiling tool, profvis (https://rstudio.github.io/profvis/), which provides interactive visualization for profiling R code.

It is an R package and has been integrated into RStudio. To install the package, run the following code:

install.packages("profvis") 

As soon as the package is installed, we can use profvis to profile an expression and visualize the results:

library(profvis)
profvis({
  my_cumsum1 <- function(x) {
    y <- numeric()
    sum_x <- 0
    for (xi in x) {
      sum_x <- sum_x + xi
      y <- c(y, sum_x)
    }
    y
  }
  
  x <- rnorm(1000)
  
  for (i in 1:1000) {
    my_cumsum1(x)
  }
})

When the profiling is finished, a new tab will appear with an interactive user interface:

Profiling code with profvis

The upper pane shows the code, memory usage, and timing, whereas the lower pane shows the timeline of function calling as well as when garbage collection occurs. We can click and select a certain line of code and see the timeline of function execution. Compared with the results produced by summaryRprof(), this interactive visualization provides much richer information that enables us to know more about how the code is executed over a long time. In this way, we can easily identify the slow code and some patterns that may induce problems.

We can do exactly the same thing with my_cumsum2():

profvis({
  my_cumsum2 <- function(x) {
    y <- numeric(length(x))
    y[[1]] <- x[[1]]
    for (i in 2:length(x)) {
      y[[i]] <- y[[i-1]] + x[[i]]
    }
    y
  }
  
  x <- rnorm(1000)
  
  for (i in 1:1000) {
    my_cumsum2(x)
  }
})

This time, the profiling results in the following statistics:

Profiling code with profvis

We can easily identify which part takes the most time and decide whether it is acceptable. In all code, there is always a part that takes most time, but this does not indicate that it is too slow. If the code serves our purpose and the performance is acceptable, then there may not be a need to optimize the performance at the risk of modifying the code into an incorrect version.

Understanding why code can be slow

In the previous sections, you learned about the tools for timing and profiling code. To solve the same problem, one function can be blazing fast, and the other can be ridiculously slow. It is helpful to understand what can make code slow.

First, R is a dynamic programming language. By design, it provides highly flexible data structures and code-execution mechanisms. Therefore, it is hard for the code interpreter to know in advance how to deal with the next function call until it is actually called. This is not the case for strong-typed static programming languages such as C and C++. Many things are determined at compile time rather than runtime, so the program knows a lot ahead of time, and optimization can be intensively performed. By contrast, R trades flexibility for performance, but well-written R code can exhibit acceptable, if not good, performance.

The top reason why R code can be slow is that our code may intensively create, allocate, or copy data structures. This is exactly why my_cumsum1() and my_cumsum2() show great difference in performance when the input gets longer. The my_cumsum1() function always grows the vector, which means that in each iteration the vector is copied to a new address and a new element is appended. As a result, the more iterations we have, the more elements it has to copy, and then the code gets slower.

This can be made explicit by the following benchmarking: grow_by_index means we initialize an empty list. The preallocated function means we initialize a list with pre-allocated positions, that is, a list of n NULL values with all positions allocated. In both cases, we modify the ith element of the list, but the difference is that we'll grow the first list in each iteration, and this does not happen with the second list because it is already fully allocated:

n <- 10000 
microbenchmark(grow_by_index = { 
  x <- list() 
  for (i in 1:n) x[[i]] <- i 
}, preallocated = { 
  x <- vector("list", n) 
  for (i in 1:n) x[[i]] <- i 
}, times = 20) 
## Unit: milliseconds 
##           expr        min         lq       mean     median 
##  grow_by_index 258.584783 261.639465 299.781601 263.896162 
##   preallocated   7.151352   7.222043   7.372342   7.257661 
##          uq        max neval cld 
##  351.887538 375.447134    20   b 
##    7.382103   8.612665    20  a 

The results are clear: intensively growing a list can significantly slow down the code, while modifying a pre-allocated list within range is fast. The same logic also applies to atomic vectors and matrices. Growing a data structure in R is generally slow because it triggers reallocation, that is, copying the original data structure to a new memory address. This is very expensive in R, especially when the data is large.

However, accurate pre-allocation is not always feasible because it requires that we know the total number prior to the iteration. Sometimes, we can only ask for a result to store repeatedly without knowing the exact total number. In this case, maybe it is still a good idea to pre-allocate a list or vector with a reasonable length. When the iteration is over, if the number of iterations does not reach the pre-allocated length, we can take a subset of the list or vector. In this way, we can avoid intensive reallocation of data structures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset