Reusing Java VMs to improve the performance

In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior.

How to do it...

  1. Run the WordCount sample by passing the following option as an argument:
    >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.reuse.jvm.num.tasks=-1 /data/input1 /data/output1
    
  2. Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job.

    However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method.

How it works...

By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset