RHadoop is a collection of open source packages using which an R user can manage and analyze data stored in the Hadoop Distributed File System (HDFS). In the background, RHadoop will translate these as MapReduce operations in Java and run them on HDFS.
The various packages in RHadoop and their uses are as follows:
Unlike the other packages discussed in this book, the packages associated with RHadoop are not available from CRAN. They can be downloaded from the GitHub repository at https://github.com/RevolutionAnalytics and are installed from the local drive.
Here is a sample MapReduce code written using the rmr2 package to count the number of words in a corpus (reference 3 in the References section of this chapter):
rmr2
library:>library(rmr2) >LOCAL <- T #to execute rmr2 locally
>#map function >map.wc <- function(k,lines){ words.list <- strsplit(lines,'\s+^' ) words <- unlist(words.list) return(keyval(words,1)) }
>#reduce function >reduce.wc<-function(word,counts){ return(keyval(word,sum(counts) )) }
hdfs.data
stored in the HDFS containing the input text:>#word count function >wordcount<-function(input,output=NULL){ mapreduce(input = input,output = output,input.format = "text",map = map.wc,reduce = reduce.wc,combine = T) } >out<-wordcount(hdfs.data,hdfs.out)
>results<-from.dfs(out) >results.df<-as.data.frame(results,stringAsFactors=F) >colnames(results.df)<-c('word^' ,^' count^') >head(results.df)