Now, we presume that you are aware of R, what it is, how to install it, what it's key features are, and why you may want to use it. Now we need to know the limitations of R (this is a better introduction to Hadoop). Before processing the data; R needs to load the data into random access memory (RAM). So, the data needs to be smaller than the available machine memory. For data that is larger than the machine memory, we consider it as Big Data (only in our case as there are many other definitions of Big Data).
To avoid this Big Data issue, we need to scale the hardware configuration; however, this is a temporary solution. To get this solved, we need to get a Hadoop cluster that is able to store it and perform parallel computation across a large computer cluster. Hadoop is the most popular solution. Hadoop is an open source Java framework, which is the top level project handled by the Apache software foundation. Hadoop is inspired by the Google filesystem and MapReduce, mainly designed for operating on Big Data by distributed processing.
Hadoop mainly supports Linux operating systems. To run this on Windows, we need to use VMware to host Ubuntu within the Windows OS. There are many ways to use and install Hadoop, but here we will consider the way that supports R best. Before we combine R and Hadoop, let us understand what Hadoop is.
Machine learning contains all the data modeling techniques that can be explored with the web link http://en.wikipedia.org/wiki/Machine_learning.
The structure blog on Hadoop installation by Michael Noll can be found at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/.
Hadoop is used with three different modes:
~/Hadoop-directory/bin/hadoop
that will execute a Hadoop operation as a single Java process. This is recommended for testing purposes. This is the default mode and you don't need to configure anything else. All daemons, such as NameNode, DataNode, JobTracker, and TaskTracker run in a single Java process.Hadoop can be installed in several ways; we will consider the way that is better to integrate with R. We will choose Ubuntu OS as it is easy to install and access it.
To install Hadoop over Ubuntu OS with the pseudo mode, we need to meet the following prerequisites:
Follow the given steps to install Hadoop:
// Locate to Hadoop installation directory $ cd /usr/local // Extract the tar file of Hadoop distribution $ sudo tar xzf hadoop-1.0.3.tar.gz // To move Hadoop resources to hadoop folder $ sudo mv hadoop-1.0.3 hadoop // Make user-hduser from group-hadoop as owner of hadoop directory $ sudo chown -R hduser:hadoop hadoop
$JAVA_HOME
and $HADOOP_HOME
variables to the.bashrc
file of Hadoop system user and the updated .bashrc
file looks as follows:// Setting the environment variables for running Java and Hadoop commands export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-sun // alias for Hadoop commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null aliashls="fs -ls" // Defining the function for compressing the MapReduce job output by lzop command lzohead () { hadoopfs -cat $1 | lzop -dc | head -1000 | less } // Adding Hadoop_HoME variable to PATH export PATH=$PATH:$HADOOP_HOME/bin
conf/*-site.xml
format.Finally, the three files will look as follows:
conf/core-site.xml
:<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default filesystem. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
conf/mapred-site.xml
:<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
conf/hdfs-site.xml
:<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>
After completing the editing of these configuration files, we need to set up the distributed filesystem across the Hadoop clusters or node.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We learned how to install Hadoop on a single node cluster. Now we will see how to install Hadoop on a multinode cluster (the full distributed mode).
For this, we need several nodes configured with a single node Hadoop cluster. To install Hadoop on multinodes, we need to have that machine configured with a single node Hadoop cluster as described in the last section.
After getting the single node Hadoop cluster installed, we need to perform the following steps:
192.168.0.1
in the master machine and 192.168.0.2
in the slave machine./etc/hosts
directory in both the nodes. It will look as 192.168.0.1 master
and 192.168.0.2 slave
.You can perform the Secure Shell (SSH) setup similar to what we did for a single node cluster setup. For more details, visit http://www.michael-noll.com.
conf/*-site.xml
: We must change all these configuration files in all of the nodes.conf/core-site.xml
and conf/mapred-site.xml
: In the single node setup, we have updated these files. So, now we need to just replace localhost
by master
in the value tag.conf/hdfs-site.xml
: In the single node setup, we have set the value of dfs.replication
as 1
. Now we need to update this as 2
.bin/hadoop namenode -format
Now, we have completed all the steps to install the multinode Hadoop cluster. To start the Hadoop clusters, we need to follow these steps:
hduser@master:/usr/local/hadoop$ bin/start-dfs.sh
hduser@master:/usr/local/hadoop$ bin/start-mapred.sh
hduser@master:/usr/local/hadoop$ bin/start-all.sh
hduser@master:/usr/local/hadoop$ bin/stop-all.sh
These installation steps are reproduced after being inspired by the blogs (http://www.michael-noll.com) of Michael Noll, who is a researcher and Software Engineer based in Switzerland, Europe. He works as a Technical lead for a large scale computing infrastructure on the Apache Hadoop stack at VeriSign.
Now the Hadoop cluster has been set up on your machines. For the installation of the same Hadoop cluster on single node or multinode with extended Hadoop components, try the Cloudera tool.
Cloudera Hadoop (CDH) is Cloudera's open source distribution that targets enterprise class deployments of Hadoop technology. Cloudera is also a sponsor of the Apache software foundation. CDH is available in two versions: CDH3 and CDH4. To install one of these, you must have Ubuntu with either 10.04 LTS or 12.04 LTS (also, you can try CentOS, Debian, and Red Hat systems). Cloudera manager will make this installation easier for you if you are installing a Hadoop on cluster of computers, which provides GUI-based Hadoop and its component installation over a whole cluster. This tool is very much recommended for large clusters.
We need to meet the following prerequisites:
The installation steps are as follows:
cloudera-manager-installer.bin
file from the download section of the Cloudera website. After that, store it at the cluster so that all the nodes can access this. Allow ownership for execution permission of cloudera-manager-installer.bin
to the user. Run the following command to start execution.$ sudo ./cloudera-manager-installer.bin
http://localhost:7180
in your address bar. You can also use any of the following browsers:admin
for both the username and password. Later on you can change it as per your choice.To avoid these installation steps, use preconfigured Hadoop instances with Amazon Elastic MapReduce and MapReduce.
If you want to use Hadoop on Windows, try the HDP tool by Hortonworks. This is 100 percent open source, enterprise grade distribution of Hadoop. You can download the HDP tool at http://hortonworks.com/download/.