42 | Big Data Simplied
3.3 CONFIGURING A HADOOP CLUSTER
One Hadoop cluster consists of master and slave machine (Linux box). The main congura-
tion les of Hadoop cluster are
hadoop-env.sh, core-site.xml, hdfs-site.xml,
mapred-site.xml and yarn-site.xml. Hadoop package has a dened le structure
and these les are in the path
$HADOOP_HOME/etc/hadoop. Here, $HADOOP_HOME is the
Hadoop software package path like
/usr/local/hadoop.
Hadoop cluster can be configured in three modes as explained below.
• Standalone Mode: This is the default mode to configure a Hadoop cluster. This mode is
mainly used for debugging and testing purposes, and it does not support the use of HDFS
operations.
• Pseudo-Distributed Mode (single/double node cluster): In this cluster, you need to con-
figure all the four main xml files as mentioned above. All Hadoop daemons (Java processes)
run on the same node. A single Linux machine acts both as master and as slave.
• Fully Distributed Mode (multiple node cluster): This type of cluster is used in an indus-
trial application mainly for different layers of development, testing and production. Separate
Linux boxes are allotted as master and slave. In addition, the need to configure failover for
NameNode and Resource Manager here for high availability.
In an industrial application, there are different types of Hadoop clusters used and they are
explained as follows.
• Sandbox cluster: It is more like a playground area where research can be done with testing
on different service configurations, resource management (CPU, JVM memory, cache mem-
ory, etc.) of different jobs. Naturally, this type of cluster has quite low level of resources in
terms of physical hard disk size and memory.
• Development cluster: The development activities on multiple applications are carried out in
this cluster. The cluster size fully depends upon the number of users and applications.
• User acceptance testing (UAT) cluster: This is a cluster for testing the application before it
is deployed in the production environment.
• Production cluster: It is the cluster, which is to be used as production environment.
Obviously, it is highly resource-intensive.
• DR (Disaster Recovery) cluster: The Disaster Recovery cluster is primarily used for data
archiving.
Basically, the user executes all commands or scripts or applications from another Linux machine
called the Edge Node or Gateway Node. This node, in parallel, connects to the specic Hadoop
cluster and performs the user commands inside the Hadoop engine. Only the user’s data is stored
in the Edge Node and it does not contain any Hadoop services. Users only access the shared
mount point location in Edge Node and perform Hadoop commands or jobs.
M03 Big Data Simplified XXXX 01.indd 42 5/10/2019 9:57:26 AM
Introducing Hadoop | 43
Steps to configure a single node Hadoop cluster in Pseudo-Distributed Mode.
1. Download and install Linux OS as
separate partition or using oracle virtual
box. Create an user as hduser and group
as hadoop.
2. Download Hadoop package from
apache site and after untar move that
package into local file system,
i.e., mv hadoop-2.7.3/usr/local/hadoop.
3. Install oracle JAVA and configure
different hadoop daemons with
different xml’s with respective port.
Password
less SSH
4. Configure and install open
ssh
5. Configure ‘core-site.xml’, ‘hdfs-site.xml’,
‘mapred-site.xml’, ‘yarn-site.xml’ and ‘hadoop-
env.sh’ file inside ‘/usr/local/Hadoop/etc/Hadoop’.
Set JAVA_HOME and
HADOOP_HOME in .bashrc file
NameNode
format
6. START HDFS and YARN (NN,
SNN, DN, RM and NM)
FAIL
PASS
YES
NO
1. Need a fresh Linux install, for example, Ubuntu or use hypervisor (virtual machine) like
Oracle virtual box and install a fresh Linux OS. Create a Linux user and group, for exam-
ple, user as
hduser and group as ‘hadoop’. This type of names are defined as industry
standard.
$ sudoaddgrouphadoop
$ sudoadduser –ingroup hadoophduser
2. First download the latest version of Hadoop software, for example, Hadoop-2.7.3.tar.gz
In Linux system,
tar file is nothing but a zip file in Windows. Therefore, you need to
untar it with the below command.
$ tar -xzvfhadoop-2.7.3.tar.gz
Then move the untar directory, i.e., hadoop-2.7.3 into a Linux system path as follows.
$ cd /usr/local
$ sudomkdirhadoop
$ sudo mv hadoop-2.7.3/usr/local/hadoop
List the hadoop folder now as shown below.
$ cd/usr/local/hadoop
$ ls -ltr
M03 Big Data Simplified XXXX 01.indd 43 5/10/2019 9:57:26 AM
44 | Big Data Simplied
Let’s examine the important directories in the Hadoop package.
etc It has the configuration xml files for Hadoop environment to configure a cluster.
bin It includes various executable commands to start/stop Hadoop daemons.
share —It contains all the jars or libraries that is required to develop or execute any
MapReduce job.
3. Prerequisite: Now setup Hadoop in Pseudo-Distributed cluster. At least 30% of free disk
space in the hard disk is required to run Hadoop properly.
Before configuring Hadoop, oracle Java and SSH (secure shell) must be installed.
To install Oracle Java:
sudo apt update; sudo apt install oracle-java8-installer
4. We can check SSH by using the following command.
$ ssh localhost
Master node (NameNode) communicates with slave node(datanode) very frequently over
SSH protocol. In Pseudo-Distributed mode, only a single node exists (own machine, i.e.,
localhost) and master slave interaction is managed by JVM. Since communication is very
frequent,
ssh should be password-less. Authentication needs to be done using Public key.
The above command may not work if
ssh is not installed in the machine. Use the command
as mentioned below to install
ssh.
$ sudo apt-get install openssh-server
To disable password authentication, i.e., we need to communicate with master node to slave
node as password-less, use the following command.
$ nano /etc/ssh/sshd_config
M03 Big Data Simplified XXXX 01.indd 44 5/10/2019 9:57:26 AM
Introducing Hadoop | 45
Edit the following line to no as shown below.
$Password Authentication no
Now restart ssh to apply the settings.
$/etc/init.d/sshd restart
Update .bashrc file with all the Hadoop components path as shown below.
$ vi .bashrc
Add the following lines at the extreme below of this below. It is a system generated environ-
ment file, so it’s a good practice to add all the exports below the file as shown below.
exportJAVA_HOME=/usr/lib/jvm/java-8-oracleexport
PATH=$PATH:$JAVA_HOME/binexport
HADOOP_HOME=/usr/local/hadoopexport
CLASSPATH=$HADOOP_HOME/etc/Hadoop
5. Now update the main Hadoop configuration files to configure all the Hadoop daemons. The
path of the files are
$HADOOP_HOME/etc/hadoop.
hadoop-env.sh
This is a main environment setup file in Hadoop. Here, you have to place your JAVA_HOME
path as follows.
export JAVA_HOME =usr/lib/jvm/java-8-oracleexport
core-site.xml
You have to update this file with NameNode as follows.
nano core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
M03 Big Data Simplified XXXX 01.indd 45 5/10/2019 9:57:27 AM
46 | Big Data Simplied
hdfs-site.xml
Now we will configure HDFS using the file hdfs-site.xml.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name> dfs.data.dir </name>
<value>/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
M03 Big Data Simplified XXXX 01.indd 46 5/10/2019 9:57:27 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset