Learning the different ways to write Hadoop MapReduce in R

We know that Hadoop Big Data processing with MapReduce is a big deal for statisticians, web analysts, and product managers who used to use the R tool for analyses because supplementary programming knowledge of MapReduce is required to migrate the analyses into MapReduce with Hadoop. Also, we know R is a tool that is consistently increasing in popularity; there are many packages/libraries that are being developed for integrating with R. So to develop a MapReduce algorithm or program that runs with the log of R and computation power of Hadoop, we require the middleware for R and Hadoop. RHadoop, RHIPE, and Hadoop streaming are the middleware that help develop and execute Hadoop MapReduce within R. In this last section, we will talk about RHadoop, RHIPE, and introducing Hadoop streaming, and from the later chapters we will purely develop MapReduce with these packages.

Learning RHadoop

RHadoop is a great open source software framework of R for performing data analytics with the Hadoop platform via R functions. RHadoop has been developed by Revolution Analytics, which is the leading commercial provider of software and services based on the open source R project for statistical computing. The RHadoop project has three different R packages: rhdfs, rmr, and rhbase. All these packages are implemented and tested on the Cloudera Hadoop distributions CDH3, CDH4, and R 2.15.0. Also, these are tested with the R version 4.3, 5.0, and 6.0 distributions of Revolution Analytics.

These three different R packages have been designed on Hadoop's two main features HDFS and MapReduce:

  • rhdfs: This is an R package for providing all Hadoop HDFS access to R. All distributed files can be managed with R functions.
  • rmr: This is an R package for providing Hadoop MapReduce interfaces to R. With the help of this package, the Mapper and Reducer can easily be developed.
  • rhbase: This is an R package for handling data at HBase distributed database through R.

Learning RHIPE

R and Hadoop Integrated Programming Environment (RHIPE) is a free and open source project. RHIPE is widely used for performing Big Data analysis with D&R analysis. D&R analysis is used to divide huge data, process it in parallel on a distributed network to produce intermediate output, and finally recombine all this intermediate output into a set. RHIPE is designed to carry out D&R analysis on complex Big Data in R on the Hadoop platform. RHIPE was developed by Saptarshi Joy Guha (Data Analyst at Mozilla Corporation) and her team as part of her PhD thesis in the Purdue Statistics Department.

Learning Hadoop streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run MapReduce jobs with any executable or script as the Mapper and/or Reducer. This is supported by R, Python, Ruby, Bash, Perl, and so on. We will use the R language with a bash script.

Also, there is one R package named HadoopStreaming that has been developed for performing data analysis on Hadoop clusters with the help of R scripts, which is an interface to Hadoop streaming with R. Additionally, it also allows the running of MapReduce tasks without Hadoop. This package was developed by David Rosenberg, Chief Scientist at SenseNetworks. He has expertise in machine learning and statistical modeling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset