Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Setting failure percentages and skipping bad records

When processing a large amount of data, there may be cases where a small amount of map tasks will fail, but still the final results make sense without the failed map tasks. This could happen due to a number of reasons such as:

Bugs in the map task
Small percentage of data records are not well formed
Bugs in third-party libraries

In the first case, it is best to debug, find the cause for failures, and fix it. However, in the second and third cases, such errors may be unavoidable. It is possible to tell Hadoop that the job should succeed even if some small percentage of map tasks have failed.

This can be done in two ways:

Setting the failure percentages
Asking Hadoop to skip bad records

This recipe explains how to configure this behavior.

Getting ready

Start the Hadoop setup. Refer to the Setting Hadoop in a distributed cluster environment recipe from the Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

Run the WordCount sample by passing the following options:

>bin/hadoop jar hadoop-examples-1.0.0.jar wordcount
-Dmapred.skip.map.max.skip.records=1
-Dmapred.skip.reduce.max.skip.groups=1 /data/input1 /data/output1

However, this only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set it through JobConf.set(name, value).

How it works...

Hadoop does not support skipping bad records by default. We can turn on bad record skipping by setting the following parameters to positive values:

mapred.skip.map.max.skip.records: This sets the number of records to skip near a bad record, including the bad record
mapred.skip.reduce.max.skip.groups: This sets the number of acceptable skip groups surrounding a bad group

There's more...

You can also limit the percentage of failures in map or reduce tasks by setting the JobConf.setMaxMapTaskFailuresPercent(percent) and JobConf.setMaxReduceTaskFailuresPercent(percent) options.

Also, Hadoop repeats the tasks in case of a failure. You can control that through JobConf.setMaxMapAttempts(5).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Setting failure percentages and skipping bad records

Create new playlist

Sign In

Sign Up

Setting failure percentages and skipping bad records

Getting ready

How to do it...

How it works...

There's more...

Table of Contents for
Setting failure percentages and skipping bad records