Often Hadoop jobs are executed through a command line. Therefore, each Hadoop job has to support reading, parsing, and processing command-line arguments. To avoid each developer having to rewrite this code, Hadoop provides a org.apache.hadoop.util.Tool
interface.
src/chapter3/WordcountWithTools.java
class extends the WordCount example with support for the Tool interface.public class WordcountWithTools extends Configured implements Tool { public int run(String[] args) throws Exception { if (args.length< 2) { System.out.println("chapter3.WordCountWithTools WordCount<inDir><outDir>"); ToolRunner.printGenericCommandUsage(System.out); System.out.println(""); return -1; } Job job = new Job(getConf(), "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run( new Configuration(), new WordcountWithTools(), args); System.exit(res); }
input
folder in HDFS with /data/input/README.txt
if it doesn't already exist. It can be done through following commands:bin/hadoopfs -mkdir /data/output bin/hadoopfs -mkdir /data/input bin/hadoopfs -put README.txt /data/input
bin/hadoop jar hadoop-cookbook-chapter3.jar chapter3.WordcountWithToolsWordcount <inDir><outDir> Generic options supported are -conf<configuration file> specify an application configuration file -D <property=value> use value for given property -fs<local|namenode:port> specify a namenode -jt<local|jobtracker:port> specify a job tracker -files<comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars<comma separated list of jars> specify comma separated jar files to include in the classpath. -archives<comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions]
mapred.job.reuse.jvm.num.tasks
option to limit the number of JVMs created by the job, as we learned in an earlier recipe.bin/hadoop jar hadoop-cookbook-chapter3.jar chapter3.WordcountWithTools -D mapred.job.reuse.jvm.num.tasks=1 /data/input /data/output