Implementing parallel computation in R

Avoiding loops in R is a kind of good general principle (if you are not sure about that, take a look at this, a bit old but always great) post by Revolution Analytics at http://blog.revolutionanalytics.com/2010/11/loops-in-r.html.

The main reason why these kind of statements should be avoided is that R tends to handle your loops really slowly and, therefore, inefficiently.

Nevertheless, sometimes, these loops are really the only way to apply a given function or operation to your set of data. When dealing with these cases, and every time you are interested in improving your code efficiency, implementing parallel computation can give an important boost to your code.

The basic idea behind parallel computation is quite easy and described in the following points:

  • Take the full job; you need to, for instance, calculate the square root of one thousand numbers in a vector
  • Split it into smaller chunks of vector, n
  • Send each chunk to one of the n workers that you previously created in your CPU cores
  • Wait for the workers to do their job and send back results
  • Combine the results in one object (going on with the example, we will have one thousand vectors storing the calculated squared roots)

Getting ready

On a Unix system, parallel computation can be implemented by leveraging the doParallel package, while on a Windows OS, you will rather use the doSNOW package:

install.packages("doParallel")
install.packages("doSNOW")
library(doParallel)
library(doSNOW)

How to do it...

  1. The first step is to create workers (the maximum number is equal to the available core on your device or group of devices):
    cl = makeCluster(2)
  2. Then, register the parallel session:
    registerDoParallel(cl) #unix OS
    registerDoSNOW(cl) #windows OS
  3. After this, initialize the object that is to be worked on:
    vector <- seq(1:100000)
  4. Apply your computation to the object with parallel computation:
    result <- foreach(i = 1:length(vector), .combine= rbind) %dopar% {
      return(vector[i]/sqrt(vector[i])^3)
    }
  5. Then, terminate the parallel session:
    stopCluster(cl)

How it works...

In step one, we actually initialize two workers that are going to receive our job chunks and work on them.

You may think that's nice, but how many workers can I initialize? Infinite!

Unfortunately, this doesn't really make sense, since initializing a number of workers greater than your PC/PCs' cluster will not result in any improvement in efficiency.

The reason for this physical limit is that the core is the minimum unity that can perform an operation in your PC.

Therefore, the maximum number of workers will be equal to the number of available cores. To detect this number, you can leverage the detectCores() function from the parallel package:

install.packages(parallel)
parallel:detectCores()

If step one is a simple declaration of an object with no effect on your machine, running registerDoParallel will have the physical effect of initializing a parallel session. After performing step two, your workers will be waiting for their job.

In step four, we apply your computation on the object with parallel computation. The statement foreach()%dopar%{} is the actual piece of code that will take your complete job and separate it into smaller chunks, sending them to the workers you created.

The resulting objects received from the workers will be combined in a cumulative entity, following the value of the .combine argument:

  • c: This will result in chunks being combined into a vector, which could also be applied in our example
  • rbind and cbind: These will produce a matrix output
  • +: This will add a numeric output into one resulting number
  • *: This will multiply numeric outputs into one resulting number

There's more...

Given the increasing amount of data that is available and the consequent increasing average size of datasets we work with, parallel computation is quite a hot topic.

R provides a specific task view where you can learn about all the available tools and best practices for implementing this technique in R. You can find this task view at https://cran.r-project.org/web/views/HighPerformanceComputing.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset