Chapter 1. Getting Ready to Use R and Hadoop

The first chapter has been bundled with several topics on R and Hadoop basics as follows:

  • R Installation, features, and data modeling
  • Hadoop installation, features, and components

In the preface, we introduced you to R and Hadoop. This chapter will focus on getting you up and running with these two technologies. Until now, R has been used mainly for statistical analysis, but due to the increasing number of functions and packages, it has become popular in several fields, such as machine learning, visualization, and data operations. R will not load all data (Big Data) into machine memory. So, Hadoop can be chosen to load the data as Big Data. Not all algorithms work across Hadoop, and the algorithms are, in general, not R algorithms. Despite this, analytics with R have several issues related to large data. In order to analyze the dataset, R loads it into the memory, and if the dataset is large, it will fail with exceptions such as "cannot allocate vector of size x". Hence, in order to process large datasets, the processing power of R can be vastly magnified by combining it with the power of a Hadoop cluster. Hadoop is very a popular framework that provides such parallel processing capabilities. So, we can use R algorithms or analysis processing over Hadoop clusters to get the work done.

Getting Ready to Use R and Hadoop

If we think about a combined RHadoop system, R will take care of data analysis operations with the preliminary functions, such as data loading, exploration, analysis, and visualization, and Hadoop will take care of parallel data storage as well as computation power against distributed data.

Prior to the advent of affordable Big Data technologies, analysis used to be run on limited datasets on a single machine. Advanced machine learning algorithms are very effective when applied to large datasets, and this is possible only with large clusters where data can be stored and processed with distributed data storage systems. In the next section, we will see how R and Hadoop can be installed on different operating systems and the possible ways to link R and Hadoop.

Installing R

You can download the appropriate version by visiting the official R website.

Here are the steps provided for three different operating systems. We have considered Windows, Linux, and Mac OS for R installation. Download the latest version of R as it will have all the latest patches and resolutions to the past bugs.

For Windows, follow the given steps:

  1. Navigate to www.r-project.org.
  2. Click on the CRAN section, select CRAN mirror, and select your Windows OS (stick to Linux; Hadoop is almost always used in a Linux environment).
  3. Download the latest R version from the mirror.
  4. Execute the downloaded .exe to install R.

For Linux-Ubuntu, follow the given steps:

  1. Navigate to www.r-project.org.
  2. Click on the CRAN section, select CRAN mirror, and select your OS.
  3. In the /etc/apt/sources.list file, add the CRAN <mirror> entry.
  4. Download and update the package lists from the repositories using the sudo apt-get update command.
  5. Install R system using the sudo apt-get install r-base command.

For Linux-RHEL/CentOS, follow the given steps:

  1. Navigate to www.r-project.org.
  2. Click on CRAN, select CRAN mirror, and select Red Hat OS.
  3. Download the R-*core-*.rpm file.
  4. Install the .rpm package using the rpm -ivh R-*core-*.rpm command.
  5. Install R system using sudo yum install R.

For Mac, follow the given steps:

  1. Navigate to www.r-project.org.
  2. Click on CRAN, select CRAN mirror, and select your OS.
  3. Download the following files: pkg, gfortran-*.dmg, and tcltk-*.dmg.
  4. Install the R-*.pkg file.
  5. Then, install the gfortran-*.dmg and tcltk-*.dmg files.

After installing the base R package, it is advisable to install RStudio, which is a powerful and intuitive Integrated Development Environment (IDE) for R.

Tip

We can use R distribution of Revolution Analytics as a Modern Data analytics tool for statistical computing and predictive analytics, which is available in free as well as premium versions. Hadoop integration is also available to perform Big Data analytics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset