42 | Big Data Simplied
3.3 CONFIGURING A HADOOP CLUSTER
One Hadoop cluster consists of master and slave machine (Linux box). The main congura-
tion les of Hadoop cluster are
‘hadoop-env.sh’, ‘core-site.xml’, ‘hdfs-site.xml’,
‘mapred-site.xml’ and ‘yarn-site.xml’. Hadoop package has a dened le structure
and these les are in the path
‘$HADOOP_HOME/etc/hadoop’. Here, $HADOOP_HOME is the
Hadoop software package path like
‘/usr/local/hadoop’.
Hadoop cluster can be configured in three modes as explained below.
• Standalone Mode: This is the default mode to configure a Hadoop cluster. This mode is
mainly used for debugging and testing purposes, and it does not support the use of HDFS
operations.
• Pseudo-Distributed Mode (single/double node cluster): In this cluster, you need to con-
figure all the four main xml files as mentioned above. All Hadoop daemons (Java processes)
run on the same node. A single Linux machine acts both as master and as slave.
• Fully Distributed Mode (multiple node cluster): This type of cluster is used in an indus-
trial application mainly for different layers of development, testing and production. Separate
Linux boxes are allotted as master and slave. In addition, the need to configure failover for
NameNode and Resource Manager here for high availability.
In an industrial application, there are different types of Hadoop clusters used and they are
explained as follows.
• Sandbox cluster: It is more like a playground area where research can be done with testing
on different service configurations, resource management (CPU, JVM memory, cache mem-
ory, etc.) of different jobs. Naturally, this type of cluster has quite low level of resources in
terms of physical hard disk size and memory.
• Development cluster: The development activities on multiple applications are carried out in
this cluster. The cluster size fully depends upon the number of users and applications.
• User acceptance testing (UAT) cluster: This is a cluster for testing the application before it
is deployed in the production environment.
• Production cluster: It is the cluster, which is to be used as production environment.
Obviously, it is highly resource-intensive.
• DR (Disaster Recovery) cluster: The Disaster Recovery cluster is primarily used for data
archiving.
Basically, the user executes all commands or scripts or applications from another Linux machine
called the Edge Node or Gateway Node. This node, in parallel, connects to the specic Hadoop
cluster and performs the user commands inside the Hadoop engine. Only the user’s data is stored
in the Edge Node and it does not contain any Hadoop services. Users only access the shared
mount point location in Edge Node and perform Hadoop commands or jobs.
M03 Big Data Simplified XXXX 01.indd 42 5/10/2019 9:57:26 AM