Choosing a suitable Hadoop InputFormat for your input data format

Hadoop supports processing of many different formats and types of data through InputFormat. The InputFormat of a Hadoop MapReduce computation generates the key-value pair inputs for the mappers by parsing the input data. InputFormat also performs the splitting of the input data into logical partitions, essentially determining the number of Map tasks of a MapReduce computation and indirectly deciding the execution location of the Map tasks. Hadoop generates a map task for each logical data partition and invokes the respective mappers with the key-value pairs of the logical splits as the input.

How to do it...

The following steps show you how to use FileInputFormat based KeyValueTextInputFormat as InputFormat for a Hadoop MapReduce computation.

  1. In this example, we are going to specify the KeyValueTextInputFormat as InputFormat for a Hadoop MapReduce computation using the Job object as follows:
    Configuration conf = new Configuration();
    Job job = new Job(conf, "log-analysis");
    ……
    job.SetInputFormat(KeyValueTextInputFormat.class)
  2. Set the input paths to the job.
    FileInputFormat.setInputPaths(job, new Path(inputPath));

How it works...

KeyValueTextInputFormat is an input format for plain text files, which generates a key-value record for each line of the input text files. Each line of the input data is broken into a key (text) and value (text) pair using a delimiter character. The default delimiter is the tab character. If a line does not contain the delimiter, the whole line will be treated as the key and the value will be empty. We can specify a custom delimiter by setting a property in the job's configuration object as follows, where we use the comma character as the delimiter between the key and value.

conf.set("key.value.separator.in.input.line", ",");

KeyValueTextInputFormat is based on FileInputFormat, which is the base class for the file-based InputFormats. Hence, we specify the input path to the MapReduce computation using the setInputPaths() method of the FileInputFormat class. We have to perform this step when using any InputFormat that is based on the FileInputFormat class.

FileInputFormat.setInputPaths(job, new Path(inputPath));

We can provide multiple HDFS input paths to a MapReduce computation by providing a comma-separated list of paths. You can also use the addInputPath() static method of the FileInputFormat class to add additional input paths to a computation.

public static void setInputPaths(JobConfconf,Path... inputPaths)
public static void addInputPath(JobConfconf, Path path)

There's more...

Make sure that your mapper input data types match the data types generated by InputFormat used by the MapReduce computation.

The following are some of the InputFormat implementations that Hadoop provide to support several common data formats.

  • TextInputFormat: This is used for plain text files. TextInputFormat generates a key-value record for each line of the input text files. For each line, the key (LongWritable) is the byte offset of the line in the file and the value (Text) is the line of text. TextInputFormat is the default InputFormat of Hadoop.
  • NLineInputFormat: This is used for plain text files. NLineInputFormat splits the input files into logical splits of fixed number of lines. We can use the NLineInputFormat when we want our map tasks to receive a fixed number of lines as the input. The key (LongWritable) and value (Text) records are generated for each line in the split similar to the TextInputFormat. By default, NLineInputFormat creates a logical split (and a Map task) per line. The number of lines per split (or key-value records per Map task) can be specified as follows. NLineInputFormat generates a key-value record for each line of the input text files.
    NLineInputFormat.setNumLinesPerSplit(job,50);
  • SequenceFileInputFormat: For Hadoop Sequence file input data. Hadoop Sequence files store the data as binary key-value pairs and support data compression. SequenceFileInputFormat is useful when using the result of a previous MapReduce computation in Sequence file format as the input of a MapReduce computation.
    • SequenceFileAsBinaryInputFormat: This is a subclass of the SequenceInputFormat that presents the key (BytesWritable) and the value (BytesWritable) pairs in raw binary format
    • SequenceFileAsTextInputFormat: This is a subclass of the SequenceInputFormat that presents the key (Text) and the value (Text) pairs as strings
  • DBInputFormat: This supports reading the input data for MapReduce computation from a SQL table. DBInputFormat uses the record number as the key (LongWritable) and the query result record as the value (DBWritable).

Using multiple input data types and multiple mapper implementations in a single MapReduce application

We can use the MultipleInputs feature of Hadoop to run a MapReduce job with multiple input paths, while specifying a different InputFormat and (optionally) a mapper for each path. Hadoop will route the outputs of the different mappers to the instances of the single reducer implementation of the MapReduce computation. Multiple inputs with different InputFormat implementations is useful when we want to process multiple data sets with the same meaning but are in different input formats (comma-delimited data set and tab-delimited data set).

We can use the following addInputPath static method of the MutlipleInputs class to add the input paths and the respective input formats to the MapReduce computation.

Public static void addInputPath(Job job, Path path, Class<?extendsInputFormat>inputFormatClass)

The following is an example usage of the preceding method.

MultipleInputs.addInputPath(job, path1, CSVInputFormat.class);
MultipleInputs.addInputPath(job, path1, TabInputFormat.class);

The multiple inputs feature with both different mappers and InputFormat is useful when performing a reduce-side join of two or more data sets.

public static void addInputPath(JobConfconf,Path path,
     Class<?extendsInputFormat>inputFormatClass,
     Class<?extends Mapper>mapperClass)

The following is an example of using multiple inputs with different input formats and different mapper implementations.

MultipleInputs.addInputPath(job, accessLogPath,TextInputFormat.class, AccessLogMapper.class);
MultipleInputs.addInputPath(job, userDataPath,TextInputFormat.class, UserDataMapper.class);

See also

  • Adding support for new input data formats– implementing a custom InputFormat
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset