Other R packages for large scale machine learning

Apart from RHadoop and SparkR, there are several other native R packages specifically built for large-scale machine learning. Here, we give a brief overview of them. Interested readers should refer to CRAN Task View: High-Performance and Parallel Computing with R (reference 10 in the References section of the chapter).

Though R is single-threaded, there exists several packages for parallel computation in R. Some of the well-known packages are Rmpi (R version of the popular message passing interface), multicore, snow (for building R clusters), and foreach. From R 2.14.0, a new package called parallel started shipping with the base R. We will discuss some of its features here.

The parallel R package

The parallel package is built on top of the multicore and snow packages. It is useful for running a single program on multiple datasets such as K-fold cross validation. It can be used for parallelizing in a single machine over multiple CPUs/cores or across several machines. For parallelizing across a cluster of machines, it evokes MPI (message passing interface) using the Rmpi package.

We will illustrate the use of parallel package with the simple example of computing a square of numbers in the list 1:100000. This example will not work in Windows since the corresponding R does not support the multicore package. It can be tested on any Linux or OS X platform.

The sequential way of performing this operation is to use the lapply function as follows:

>nsquare <- function(n){return(n*n)}
>range <- c(1:100000)

Using the mclapply function of the parallel package, this computation can be achieved in much less time:

>library(parallel) #included in R core packages, no separate installation required
>numCores<-detectCores( )  #to find the number of cores in the machine

If the dataset is so large that it needs a cluster of computers, we can use the parLapply function to run the program over a cluster. This needs the Rmpi package:

>install.packages(Rmpi)#one time
>numNodes<-4 #number of workers nodes
>mpi.exit( )

The foreach R package

This is a new looping construct in R that can be executed in parallel across multicores or clusters. It has two important operators: %do% for repeatedly doing a task and %dopar% for executing tasks in parallel.

For example, the squaring function we discussed in the previous section can be implemented using a single line command using the foreach package:

>install.packages(foreach)#one time
>install.packages(doParallel)#one time
>system.time(foreach(i=1:100000)   %do%  i^2) #for executing sequentially
>system.time(foreach(i=1:100000)   %dopar%  i^2) #for executing in parallel

We will also do an example of quick sort using the foreach function:

>qsort<- function(x) {
  n <- length(x)
  if (n == 0) {
  } else {
    p <- sample(n,1)
    smaller <- foreach(y=x[-p],.combine=c) %:% when(y <= x[p]) %do% y
    larger  <- foreach(y=x[-p],.combine=c) %:% when(y >  x[p]) %do% y

These packages are still undergoing a lot of development. They have not yet been used in a large way for Bayesian modeling. It is easy to use them for Bayesian inference applications such as Monte Carlo simulations.

