Tuning Hadoop configurations for cluster deployments

Getting ready

Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME.

How to do it...

We can control Hadoop configurations through the following three configuration files:

  • conf/core-site.xml: This contains the configurations common to whole Hadoop distribution
  • conf/hdfs-site.xml: This contains configurations for HDFS
  • conf/mapred-site.xml: This contains configurations for MapReduce

Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag.

<configuration>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>20</value>
</property>
...
</configuration>

The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks:

  1. Create a directory to store the logfiles. For example, /root/hadoop_logs.
  2. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/hadoop-env.sh and point it to the new directory.
  3. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file:
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2 </value>
    </property>
    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>2 </value>
    </property>
  4. Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory.
  5. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager.

How it works...

HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment.

These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site.xml files. Hadoop reloads these configurations after a restart.

There's more...

There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables.

The configuration properties for conf/core-site.xml are listed in the following table:

Name

Default value

Description

fs.inmemory.size.mb

100

This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs.

io.sort.factor

100

This is the maximum number of streams merged while sorting files.

io.file.buffer.size

131072

This is the size of the read/write buffer used by sequence files.

The configuration properties for conf/mapred-site.xml are listed in the following table:

Name

Default value

Description

mapred.reduce.parallel.copies

5

This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs.

mapred.map.child.java.opts

-Xmx200M

This is for passing Java options into the map JVM.

mapred.reduce.child.java.opts

-Xmx200M

This is for passing Java options into the reduce JVM.

io.sort.mb

200

The memory limit while sorting data in MBs.

The configuration properties for conf/hdfs-site.xml are listed in the following table:

Name

Default value

Description

dfs.block.size

67108864

This is the HDFS block size.

dfs.namenode.handler.count

40

This is the number of server threads to handle RPC calls in the NameNode.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset