Debug scripts – analyzing task failures

A Hadoop job may consist of many map tasks and reduce tasks. Therefore, debugging a Hadoop job is often a complicated process. It is a good practice to first test a Hadoop job using unit tests by running it with a subset of the data.

However, sometimes it is necessary to debug a Hadoop job in a distributed mode. To support such cases, Hadoop provides a mechanism called debug scripts . This recipe explains how to use debug scripts.

Getting ready

Start the Hadoop cluster. Refer to the Setting Hadoop in a distributed cluster environment recipe from Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

A debug script is a shell script, and Hadoop executes the script whenever a task encounters an error. The script will have access to the $script, $stdout, $stderr, $syslog, and $jobconf properties, as environment variables populated by Hadoop. You can find a sample script from resources/chapter3/debugscript. We can use the debug scripts to copy all the logfiles to a single location, e-mail them to a single e-mail account, or perform some analysis.

LOG_FILE=HADOOP_HOME/error.log

echo "Run the script" >> $LOG_FILE

echo $script >> $LOG_FILE
echo $stdout>> $LOG_FILE
echo $stderr>> $LOG_FILE
echo $syslog >> $LOG_FILE
echo $jobconf>> $LOG_FILE
  1. Write your own debug script using the above example. In the above example, edit HADOOP_HOME to point to your HADOOP_HOME directory.

    src/chapter3/WordcountWithDebugScript.java extends the WordCount sample to use debug scripts. The following listing shows the code.

    The following code uploads the job scripts to HDFS and configures the job to use these scripts. Also, it sets up the distributed cache.

    private static final String scriptFileLocation = 
    "resources/chapter3/debugscript";
    public static void setupFailedTaskScript(JobConfconf) 
    throws Exception {
    
    // create a directory on HDFS where we'll upload the fail 
       //scripts
    FileSystemfs = FileSystem.get(conf);
    Path debugDir = new Path("/debug");
    
    // who knows what's already in this directory; let's just 
       //clear it.
    if (fs.exists(debugDir)) {
    fs.delete(debugDir, true);
    }
    
    // ...and then make sure it exists again
    fs.mkdirs(debugDir);
    
    // upload the local scripts into HDFS
    fs.copyFromLocalFile(new Path(scriptFileLocation), 
    new Path("/debug/fail-script"));
    
    conf.setMapDebugScript("./fail-script");
    conf.setReduceDebugScript("./fail-script");
    
    DistributedCache.createSymlink(conf);
    
    URI fsUri = fs.getUri();
    
    String mapUriStr = fsUri.toString()
    + "/debug/fail-script#fail-script"; 
    
    URI mapUri = new URI(mapUriStr);
    DistributedCache.addCacheFile(mapUri, conf);
    }

    The following code runs the Hadoop job as we described in Chapter 1, Getting Hadoop Up and Running in a Cluster. The only difference is that here, we have called the preceding method to configure failed task scripts.

    public static void main(String[] args) throws Exception 
    {
      JobConfconf = new JobConf();
      setupFailedTaskScript(conf); 
      Job job = new Job(conf, "word count");
      job.setJarByClass(FaultyWordCount.class);
      job.setMapperClass(FaultyWordCount.TokenizerMapper.class);
      job.setReducerClass(FaultyWordCount.IntSumReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.waitForCompletion(true);
    }
  2. Compile the code base by running Ant from home directory of the source code. Copy the build/lib/hadoop-cookbook-chapter3.jar to HADOOP_HOME.
  3. Then run the job by running the following command:
    >bin/hadoopjarhadoop-cookbook-chapter3.jarchapter3.WordcountWithDebugScript /data/input /data/output1
    

    The job will run the FaultyWordCount task that will always fail. Then Hadoop will execute the debug script, and you can find the results of the debug script from HADOOP_HOME.

How it works...

We configured the debug script through conf.setMapDebugScript("./fail-script"). However, the input value is not the file location, but the command that needs to be run on the machine when an error occurs. If you have a specific file that is present in all machines that you want to run when an error occurs, you can just add that path through the conf.setMapDebugScript("./fail-script") method.

But, Hadoop runs the mappers in multiple nodes, and often in a machine different than the machine running the job's client. Therefore, for the debug script to work, we need to get the script to all the nodes running the mapper.

We do this using the distributed cache . As described in the Using Distributed cache to distribute resources recipe in Chapter 4, Developing Complex Hadoop MapReduce Applications, users can add files that are in the HDFS filesystem to distribute cache. Then, Hadoop automatically copies those files to each node by running map tasks. However, distributed cache copies the files to mapred.local.dir of the MapReduce setup, but it runs the job from a different location. Therefore, we link the cache directory to the working directory by creating a symlink using the DistributedCache.createSymlink(conf) command.

Then Hadoop copies the script files to each mapper node and symlinks it to the working directory of the job. When an error occurs, Hadoop will run the ./fail-script command, which will run the script file that has been copied to the node through distributed cache. The debug script will carry out the tasks you have programmed when an error occurs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset