Avoiding loops in R is a kind of good general principle (if you are not sure about that, take a look at this, a bit old but always great) post by Revolution Analytics at http://blog.revolutionanalytics.com/2010/11/loops-in-r.html.
The main reason why these kind of statements should be avoided is that R tends to handle your loops really slowly and, therefore, inefficiently.
Nevertheless, sometimes, these loops are really the only way to apply a given function or operation to your set of data. When dealing with these cases, and every time you are interested in improving your code efficiency, implementing parallel computation can give an important boost to your code.
The basic idea behind parallel computation is quite easy and described in the following points:
On a Unix system, parallel computation can be implemented by leveraging the doParallel
package, while on a Windows OS, you will rather use the doSNOW
package:
install.packages("doParallel") install.packages("doSNOW") library(doParallel) library(doSNOW)
cl = makeCluster(2)
registerDoParallel(cl) #unix OS registerDoSNOW(cl) #windows OS
vector <- seq(1:100000)
result <- foreach(i = 1:length(vector), .combine= rbind) %dopar% { return(vector[i]/sqrt(vector[i])^3) }
stopCluster(cl)
In step one, we actually initialize two workers that are going to receive our job chunks and work on them.
You may think that's nice, but how many workers can I initialize? Infinite!
Unfortunately, this doesn't really make sense, since initializing a number of workers greater than your PC/PCs' cluster will not result in any improvement in efficiency.
The reason for this physical limit is that the core is the minimum unity that can perform an operation in your PC.
Therefore, the maximum number of workers will be equal to the number of available cores. To detect this number, you can leverage the detectCores()
function from the parallel
package:
install.packages(parallel) parallel:detectCores()
If step one is a simple declaration of an object with no effect on your machine, running registerDoParallel
will have the physical effect of initializing a parallel session. After performing step two, your workers will be waiting for their job.
In step four, we apply your computation on the object with parallel computation. The statement foreach()%dopar%{}
is the actual piece of code that will take your complete job and separate it into smaller chunks, sending them to the workers you created.
The resulting objects received from the workers will be combined in a cumulative entity, following the value of the .combine
argument:
c
: This will result in chunks being combined into a vector, which could also be applied in our examplerbind
and cbind
: These will produce a matrix output+
: This will add a numeric output into one resulting number*
: This will multiply numeric outputs into one resulting numberGiven the increasing amount of data that is available and the consequent increasing average size of datasets we work with, parallel computation is quite a hot topic.
R provides a specific task view where you can learn about all the available tools and best practices for implementing this technique in R. You can find this task view at https://cran.r-project.org/web/views/HighPerformanceComputing.html.