Using the Hadoop Tool interface

Often Hadoop jobs are executed through a command line. Therefore, each Hadoop job has to support reading, parsing, and processing command-line arguments. To avoid each developer having to rewrite this code, Hadoop provides a org.apache.hadoop.util.Tool interface.

How to do it...

  1. In the source code for this chapter, the src/chapter3/WordcountWithTools.java class extends the WordCount example with support for the Tool interface.
    public class WordcountWithTools extends  
       Configured implements Tool
    {
      public int run(String[] args) throws Exception
      {
        if (args.length< 2)
        {
          System.out.println("chapter3.WordCountWithTools 
          WordCount<inDir><outDir>");
          ToolRunner.printGenericCommandUsage(System.out);
          System.out.println("");
          return -1;
        }
       Job job = new Job(getConf(), "word count");
       job.setJarByClass(WordCount.class);
       job.setMapperClass(TokenizerMapper.class);
       job.setReducerClass(IntSumReducer.class);
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(IntWritable.class);
       FileInputFormat.addInputPath(job, new Path(args[0]));
       FileOutputFormat.setOutputPath(job, new Path(args[1]));
       job.waitForCompletion(true);
       return 0;
     }
      public static void main(String[] args)
        throws Exception
      {
        int res = ToolRunner.run(
           new Configuration(), new WordcountWithTools(), args);
        System.exit(res);
      }
  2. Set up a input folder in HDFS with /data/input/README.txt if it doesn't already exist. It can be done through following commands:
    bin/hadoopfs -mkdir /data/output
    bin/hadoopfs -mkdir /data/input
    bin/hadoopfs -put README.txt /data/input
    
  3. Try to run the WordCount without any options, and it will list the available options.
    bin/hadoop jar hadoop-cookbook-chapter3.jar chapter3.WordcountWithToolsWordcount <inDir><outDir>
    Generic options supported are
    -conf<configuration file>     specify an application configuration file
    -D <property=value>            use value for given property
    -fs<local|namenode:port>      specify a namenode
    -jt<local|jobtracker:port>    specify a job tracker
    -files<comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
    -libjars<comma separated list of jars>    specify comma separated jar files to include in the classpath.
    -archives<comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
    
    The general command line syntax is
    bin/hadoop command [genericOptions] [commandOptions]
    
  4. Run the WordCount sample with the mapred.job.reuse.jvm.num.tasks option to limit the number of JVMs created by the job, as we learned in an earlier recipe.
    bin/hadoop jar hadoop-cookbook-chapter3.jar
    chapter3.WordcountWithTools
    -D mapred.job.reuse.jvm.num.tasks=1  /data/input /data/output
    

How it works...

When a job extends from the Tool interface, Hadoop will intercept the command-line arguments, parse the options, and configure the JobConf object accordingly. Therefore, the job will support standard generic options.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset