The Hadoop framework is very flexible and can be tuned using a number of configuration parameters. In this recipe, we will discuss the function and purpose of different configuration parameters you can set for a MapReduce job.
Ensure that you have a MapReduce job which has a job class that extends the Hadoop Configuration
class and implements the Hadoop Tool
interface, such as any MapReduce application we have written so far in this book.
Follow these steps to customize MapReduce job parameters:
Configuration
class and the Tool
interface.ToolRunner.run()
static method to run your MapReduce job, as shown in the following example:public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MyMapReduceJob(), args); System.exit(exitCode); }
Property name |
Possible values |
Description |
---|---|---|
Integers (0 - N) |
Sets the number of reducers to launch. | |
JVM key-value pairs |
These parameters are given as arguments to every task JVM. For example, to set the maximum heap size for all tasks to 1 GB, you would set this property to '-Xmx1GB'. | |
JVM key-value pairs |
These parameters are given as arguments to every map task JVM. | |
JVM key-value pairs |
These parameters are given as arguments to every reduce task JVM. | |
Boolean (true/false) |
Tells the Hadoop framework to speculatively launch the exact same map task on different nodes in the cluster if a task is not performing well as compared to other tasks in the job. This property was discussed in Chapter 1, Hadoop Distributed File System – Importing and Exporting Data. | |
Boolean (true/false) |
Tells the Hadoop framework to speculatively launch the exact same reduce task on different nodes in the cluster if a task is not performing well as compared to other tasks in the job. | |
Integer (-1, 1 – N) |
The number of task JVMs to be re-used. A value of 1 indicates one JVM will be started per task, a value of -1 indicates a single JVM can run an unlimited number of tasks. Setting this parameter might help increase the performance of small jobs because JVMs will be re-used for multiple tasks (as opposed to starting a JVM for each and every task). | |
|
Boolean (true/false) String (NONE, RECORD, or BLOCK) String (Name of compression codec class) |
These three parameters are used to compress the output of map tasks. |
|
Boolean (true/false) String (NONE, RECORD, or BLOCK) |
These three parameters are used to compress the output of a MapReduce job. |
$ cd /path/to/hadoop $ bin/hadoop –jar MyJar.jar com.packt.MyJobClass –Dmapred.reduce.tasks=5
When a job class extends the Hadoop Configuration
class and implements the Hadoop Tool
interface, the ToolRunner
class will automatically handle the following generic Hadoop arguments:
Argument/Flag |
Purpose |
---|---|
Takes a path to a parameter configuration file. | |
Used to specify Hadoop key/value properties which will be added to the job configuration | |
Used to specify the host port of the NameNode | |
Used to specify the host port of the JobTracker |
In the case of this recipe, the
ToolRunner
class will automatically place all of the parameters specified with the -D
flag into the job configuration XML file.