Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tuning Hadoop configurations for cluster deployments

Getting ready

Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME.

How to do it...

We can control Hadoop configurations through the following three configuration files:

conf/core-site.xml: This contains the configurations common to whole Hadoop distribution
conf/hdfs-site.xml: This contains configurations for HDFS
conf/mapred-site.xml: This contains configurations for MapReduce

Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag.

<configuration>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>20</value>
</property>
...
</configuration>

The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks:

Create a directory to store the logfiles. For example, /root/hadoop_logs.
Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/hadoop-env.sh and point it to the new directory.

Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file:

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2 </value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2 </value>
</property>

Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory.
You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager.

How it works...

HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment.

These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site.xml files. Hadoop reloads these configurations after a restart.

There's more...

There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables.

The configuration properties for conf/core-site.xml are listed in the following table:

Name	Default value	Description
`fs.inmemory.size.mb`	`100`	This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs.
`io.sort.factor`	`100`	This is the maximum number of streams merged while sorting files.
`io.file.buffer.size`	`131072`	This is the size of the read/write buffer used by sequence files.

The configuration properties for conf/mapred-site.xml are listed in the following table:

Name	Default value	Description
`mapred.reduce.parallel.copies`	`5`	This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs.
`mapred.map.child.java.opts`	`-Xmx200M`	This is for passing Java options into the map JVM.
`mapred.reduce.child.java.opts`	`-Xmx200M`	This is for passing Java options into the reduce JVM.
`io.sort.mb`	`200`	The memory limit while sorting data in MBs.

The configuration properties for conf/hdfs-site.xml are listed in the following table:

Name	Default value	Description
`dfs.block.size`	`67108864`	This is the HDFS block size.
`dfs.namenode.handler.count`	`40`	This is the number of server threads to handle RPC calls in the NameNode.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Tuning Hadoop configurations for cluster deployments

Create new playlist

Sign In

Sign Up

Tuning Hadoop configurations for cluster deployments

Getting ready

How to do it...

How it works...

There's more...

Table of Contents for
Tuning Hadoop configurations for cluster deployments