The first chapter has been bundled with several topics on R and Hadoop basics as follows:
In the preface, we introduced you to R and Hadoop. This chapter will focus on getting you up and running with these two technologies. Until now, R has been used mainly for statistical analysis, but due to the increasing number of functions and packages, it has become popular in several fields, such as machine learning, visualization, and data operations. R will not load all data (Big Data) into machine memory. So, Hadoop can be chosen to load the data as Big Data. Not all algorithms work across Hadoop, and the algorithms are, in general, not R algorithms. Despite this, analytics with R have several issues related to large data. In order to analyze the dataset, R loads it into the memory, and if the dataset is large, it will fail with exceptions such as "cannot allocate vector of size x". Hence, in order to process large datasets, the processing power of R can be vastly magnified by combining it with the power of a Hadoop cluster. Hadoop is very a popular framework that provides such parallel processing capabilities. So, we can use R algorithms or analysis processing over Hadoop clusters to get the work done.
If we think about a combined RHadoop system, R will take care of data analysis operations with the preliminary functions, such as data loading, exploration, analysis, and visualization, and Hadoop will take care of parallel data storage as well as computation power against distributed data.
Prior to the advent of affordable Big Data technologies, analysis used to be run on limited datasets on a single machine. Advanced machine learning algorithms are very effective when applied to large datasets, and this is possible only with large clusters where data can be stored and processed with distributed data storage systems. In the next section, we will see how R and Hadoop can be installed on different operating systems and the possible ways to link R and Hadoop.
You can download the appropriate version by visiting the official R website.
Here are the steps provided for three different operating systems. We have considered Windows, Linux, and Mac OS for R installation. Download the latest version of R as it will have all the latest patches and resolutions to the past bugs.
For Windows, follow the given steps:
.exe
to install R.For Linux-Ubuntu, follow the given steps:
/etc/apt/sources.list
file, add the CRAN <mirror>
entry.sudo apt-get update
command.sudo apt-get install r-base
command.For Linux-RHEL/CentOS, follow the given steps:
R-*core-*.rpm
file..rpm
package using the rpm -ivh R-*core-*.rpm
command.sudo yum install R
.For Mac, follow the given steps:
pkg
, gfortran-*.dmg
, and tcltk-*.dmg
.R-*.pkg
file.gfortran-*.dmg
and tcltk-*.dmg
files.After installing the base R package, it is advisable to install RStudio, which is a powerful and intuitive Integrated Development Environment (IDE) for R.