A Hadoop job may consist of many map tasks and reduce tasks. Therefore, debugging a Hadoop job is often a complicated process. It is a good practice to first test a Hadoop job using unit tests by running it with a subset of the data.
However, sometimes it is necessary to debug a Hadoop job in a distributed mode. To support such cases, Hadoop provides a mechanism called debug scripts . This recipe explains how to use debug scripts.
Start the Hadoop cluster. Refer to the Setting Hadoop in a distributed cluster environment recipe from Chapter 1, Getting Hadoop Up and Running in a Cluster.
A debug script is a shell script, and Hadoop executes the script whenever a task encounters an error. The script will have access to the $script
, $stdout
, $stderr
, $syslog
, and $jobconf
properties, as environment variables populated by Hadoop. You can find a sample script from resources/chapter3/debugscript
. We can use the debug scripts to copy all the logfiles to a single location, e-mail them to a single e-mail account, or perform some analysis.
LOG_FILE=HADOOP_HOME/error.log echo "Run the script" >> $LOG_FILE echo $script >> $LOG_FILE echo $stdout>> $LOG_FILE echo $stderr>> $LOG_FILE echo $syslog >> $LOG_FILE echo $jobconf>> $LOG_FILE
HADOOP_HOME
to point to your HADOOP_HOME
directory.src/chapter3/WordcountWithDebugScript.java
extends the WordCount sample to use debug scripts. The following listing shows the code.
The following code uploads the job scripts to HDFS and configures the job to use these scripts. Also, it sets up the distributed cache.
private static final String scriptFileLocation = "resources/chapter3/debugscript"; public static void setupFailedTaskScript(JobConfconf) throws Exception { // create a directory on HDFS where we'll upload the fail //scripts FileSystemfs = FileSystem.get(conf); Path debugDir = new Path("/debug"); // who knows what's already in this directory; let's just //clear it. if (fs.exists(debugDir)) { fs.delete(debugDir, true); } // ...and then make sure it exists again fs.mkdirs(debugDir); // upload the local scripts into HDFS fs.copyFromLocalFile(new Path(scriptFileLocation), new Path("/debug/fail-script")); conf.setMapDebugScript("./fail-script"); conf.setReduceDebugScript("./fail-script"); DistributedCache.createSymlink(conf); URI fsUri = fs.getUri(); String mapUriStr = fsUri.toString() + "/debug/fail-script#fail-script"; URI mapUri = new URI(mapUriStr); DistributedCache.addCacheFile(mapUri, conf); }
The following code runs the Hadoop job as we described in Chapter 1, Getting Hadoop Up and Running in a Cluster. The only difference is that here, we have called the preceding method to configure failed task scripts.
public static void main(String[] args) throws Exception { JobConfconf = new JobConf(); setupFailedTaskScript(conf); Job job = new Job(conf, "word count"); job.setJarByClass(FaultyWordCount.class); job.setMapperClass(FaultyWordCount.TokenizerMapper.class); job.setReducerClass(FaultyWordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
build/lib/hadoop-cookbook-chapter3.jar
to HADOOP_HOME
.>bin/hadoopjarhadoop-cookbook-chapter3.jarchapter3.WordcountWithDebugScript /data/input /data/output1
The job will run the FaultyWordCount
task that will always fail. Then Hadoop will execute the debug script, and you can find the results of the debug script from HADOOP_HOME
.
We configured the debug script through conf.setMapDebugScript("./fail-script
"). However, the input value is not the file location, but the command that needs to be run on the machine when an error occurs. If you have a specific file that is present in all machines that you want to run when an error occurs, you can just add that path through the conf.setMapDebugScript("./fail-script")
method.
But, Hadoop runs the mappers in multiple nodes, and often in a machine different than the machine running the job's client. Therefore, for the debug script to work, we need to get the script to all the nodes running the mapper.
We do this using the distributed cache
. As described in the Using Distributed cache to distribute resources recipe in Chapter 4, Developing Complex Hadoop MapReduce Applications, users can add files that are in the HDFS filesystem to distribute cache. Then, Hadoop automatically copies those files to each node by running map tasks. However, distributed cache copies the files to mapred.local.dir
of the MapReduce setup, but it runs the job from a different location. Therefore, we link the cache directory to the working directory by creating a symlink using the DistributedCache.createSymlink(conf)
command.
Then Hadoop copies the script files to each mapper node and symlinks it to the working directory of the job. When an error occurs, Hadoop will run the ./fail-script
command, which will run the script file that has been copied to the node through distributed cache. The debug script will carry out the tasks you have programmed when an error occurs.