3.3 Configuring a Hadoop Cluster (1/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

42 | Big Data Simplied

3.3 CONFIGURING A HADOOP CLUSTER

One Hadoop cluster consists of master and slave machine (Linux box). The main congura-

tion les of Hadoop cluster are

‘hadoop-env.sh’, ‘core-site.xml’, ‘hdfs-site.xml’,

‘mapred-site.xml’ and ‘yarn-site.xml’. Hadoop package has a dened le structure

and these les are in the path

‘$HADOOP_HOME/etc/hadoop’. Here, $HADOOP_HOME is the

Hadoop software package path like

‘/usr/local/hadoop’.

Hadoop cluster can be configured in three modes as explained below.

• Standalone Mode: This is the default mode to configure a Hadoop cluster. This mode is

mainly used for debugging and testing purposes, and it does not support the use of HDFS

operations.

• Pseudo-Distributed Mode (single/double node cluster): In this cluster, you need to con-

figure all the four main xml files as mentioned above. All Hadoop daemons (Java processes)

run on the same node. A single Linux machine acts both as master and as slave.

• Fully Distributed Mode (multiple node cluster): This type of cluster is used in an indus-

trial application mainly for different layers of development, testing and production. Separate

Linux boxes are allotted as master and slave. In addition, the need to configure failover for

NameNode and Resource Manager here for high availability.

In an industrial application, there are different types of Hadoop clusters used and they are

explained as follows.

• Sandbox cluster: It is more like a playground area where research can be done with testing

on different service configurations, resource management (CPU, JVM memory, cache mem-

ory, etc.) of different jobs. Naturally, this type of cluster has quite low level of resources in

terms of physical hard disk size and memory.

• Development cluster: The development activities on multiple applications are carried out in

this cluster. The cluster size fully depends upon the number of users and applications.

• User acceptance testing (UAT) cluster: This is a cluster for testing the application before it

is deployed in the production environment.

• Production cluster: It is the cluster, which is to be used as production environment.

Obviously, it is highly resource-intensive.

• DR (Disaster Recovery) cluster: The Disaster Recovery cluster is primarily used for data

archiving.

Basically, the user executes all commands or scripts or applications from another Linux machine

called the Edge Node or Gateway Node. This node, in parallel, connects to the specic Hadoop

cluster and performs the user commands inside the Hadoop engine. Only the user’s data is stored

in the Edge Node and it does not contain any Hadoop services. Users only access the shared

mount point location in Edge Node and perform Hadoop commands or jobs.

M03 Big Data Simplified XXXX 01.indd 42 5/10/2019 9:57:26 AM

Introducing Hadoop | 43

Steps to configure a single node Hadoop cluster in Pseudo-Distributed Mode.

1. Download and install Linux OS as

separate partition or using oracle virtual

box. Create an user as hduser and group

as hadoop.

2. Download Hadoop package from

apache site and after untar move that

package into local ﬁle system,

i.e., mv hadoop-2.7.3/usr/local/hadoop.

3. Install oracle JAVA and conﬁgure

different hadoop daemons with

different xml’s with respective port.

Password

less SSH

4. Conﬁgure and install open

ssh

5. Conﬁgure ‘core-site.xml’, ‘hdfs-site.xml’,

‘mapred-site.xml’, ‘yarn-site.xml’ and ‘hadoop-

env.sh’ ﬁle inside ‘/usr/local/Hadoop/etc/Hadoop’.

Set JAVA_HOME and

HADOOP_HOME in .bashrc ﬁle

NameNode

format

6. START HDFS and YARN (NN,

SNN, DN, RM and NM)

FAIL

PASS

YES

1. Need a fresh Linux install, for example, Ubuntu or use hypervisor (virtual machine) like

Oracle virtual box and install a fresh Linux OS. Create a Linux user and group, for exam-

ple, user as

‘hduser’ and group as ‘hadoop’. This type of names are defined as industry

standard.

$ sudoaddgrouphadoop

$ sudoadduser –ingroup hadoophduser

2. First download the latest version of Hadoop software, for example, Hadoop-2.7.3.tar.gz

• In Linux system,

‘tar’ file is nothing but a zip file in Windows. Therefore, you need to

untar it with the below command.

$ tar -xzvfhadoop-2.7.3.tar.gz

• Then move the untar directory, i.e., ‘hadoop-2.7.3’ into a Linux system path as follows.

$ cd /usr/local

$ sudomkdirhadoop

$ sudo mv hadoop-2.7.3/usr/local/hadoop

• List the hadoop folder now as shown below.

$ cd/usr/local/hadoop

$ ls -ltr

M03 Big Data Simplified XXXX 01.indd 43 5/10/2019 9:57:26 AM

44 | Big Data Simplied

• Let’s examine the important directories in the Hadoop package.

etc — It has the configuration xml files for Hadoop environment to configure a cluster.

bin — It includes various executable commands to start/stop Hadoop daemons.

share —It contains all the jars or libraries that is required to develop or execute any

MapReduce job.

3. Prerequisite: Now setup Hadoop in Pseudo-Distributed cluster. At least 30% of free disk

space in the hard disk is required to run Hadoop properly.

Before configuring Hadoop, oracle Java and SSH (secure shell) must be installed.

To install Oracle Java:

sudo apt update; sudo apt install oracle-java8-installer

4. We can check SSH by using the following command.

$ ssh localhost

Master node (NameNode) communicates with slave node(datanode) very frequently over

SSH protocol. In Pseudo-Distributed mode, only a single node exists (own machine, i.e.,

localhost) and master slave interaction is managed by JVM. Since communication is very

frequent,

ssh should be password-less. Authentication needs to be done using Public key.

The above command may not work if

ssh is not installed in the machine. Use the command

as mentioned below to install

ssh.

$ sudo apt-get install openssh-server

To disable password authentication, i.e., we need to communicate with master node to slave

node as password-less, use the following command.

$ nano /etc/ssh/sshd_conﬁg

M03 Big Data Simplified XXXX 01.indd 44 5/10/2019 9:57:26 AM

Introducing Hadoop | 45

Edit the following line to ‘no’ as shown below.

$Password Authentication no

Now restart ssh to apply the settings.

$/etc/init.d/sshd restart

• Update “.bashrc” file with all the Hadoop components path as shown below.

$ vi .bashrc

Add the following lines at the extreme below of this below. It is a system generated environ-

ment file, so it’s a good practice to add all the exports below the file as shown below.

exportJAVA_HOME=/usr/lib/jvm/java-8-oracleexport

PATH=$PATH:$JAVA_HOME/binexport

HADOOP_HOME=/usr/local/hadoopexport

CLASSPATH=$HADOOP_HOME/etc/Hadoop

5. Now update the main Hadoop configuration files to configure all the Hadoop daemons. The

path of the files are

‘$HADOOP_HOME/etc/hadoop’.

hadoop-env.sh

This is a main environment setup file in Hadoop. Here, you have to place your JAVA_HOME

path as follows.

export JAVA_HOME =usr/lib/jvm/java-8-oracleexport

core-site.xml

You have to update this file with NameNode as follows.

nano core-site.xml

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

M03 Big Data Simplified XXXX 01.indd 45 5/10/2019 9:57:27 AM

46 | Big Data Simplied

hdfs-site.xml

Now we will configure HDFS using the file hdfs-site.xml.

<conﬁguration>

<name>dfs.replication</name>

</property>

<value>/hadoopinfra/hdfs/namenode</value>

</property>

<value>/hadoopinfra/hdfs/datanode</value>

</property>

</conﬁguration>

M03 Big Data Simplified XXXX 01.indd 46 5/10/2019 9:57:27 AM

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3.3 Configuring a Hadoop Cluster (1/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
3.3 Configuring a Hadoop Cluster (1/2)