Hadoop supports processing of many different formats and types of data through InputFormat
. The InputFormat
of a Hadoop MapReduce computation generates the key-value pair inputs for the mappers by parsing the input data. InputFormat
also performs the splitting of the input data into logical partitions, essentially determining the number of Map tasks of a MapReduce computation and indirectly deciding the execution location of the Map tasks. Hadoop generates a map task for each logical data partition and invokes the respective mappers with the key-value pairs of the logical splits as the input.
The following steps show you how to use FileInputFormat
based KeyValueTextInputFormat
as InputFormat
for a Hadoop MapReduce computation.
KeyValueTextInputFormat
as InputFormat
for a Hadoop MapReduce computation using the Job
object as follows:Configuration conf = new Configuration(); Job job = new Job(conf, "log-analysis"); …… job.SetInputFormat(KeyValueTextInputFormat.class)
FileInputFormat.setInputPaths(job, new Path(inputPath));
KeyValueTextInputFormat
is an input format for plain text files, which generates a key-value record for each line of the input text files. Each line of the input data is broken into a key (text) and value (text) pair using a delimiter character. The default delimiter is the tab character. If a line does not contain the delimiter, the whole line will be treated as the key and the value will be empty. We can specify a custom delimiter by setting a property in the job's configuration object as follows, where we use the comma character as the delimiter between the key and value.
conf.set("key.value.separator.in.input.line", ",");
KeyValueTextInputFormat
is based on FileInputFormat
, which is the base class for the file-based InputFormats. Hence, we specify the input path to the MapReduce computation using the setInputPaths()
method of the FileInputFormat
class. We have to perform this step when using any InputFormat that is based on the FileInputFormat
class.
FileInputFormat.setInputPaths(job, new Path(inputPath));
We can provide multiple HDFS input paths to a MapReduce computation by providing a comma-separated list of paths. You can also use the addInputPath()
static method of the FileInputFormat
class to add additional input paths to a computation.
public static void setInputPaths(JobConfconf,Path... inputPaths) public static void addInputPath(JobConfconf, Path path)
Make sure that your mapper input data types match the data types generated by InputFormat
used by the MapReduce computation.
The following are some of the InputFormat
implementations that Hadoop provide to support several common data formats.
TextInputFormat
: This is used for plain text files. TextInputFormat
generates a key-value record for each line of the input text files. For each line, the key (LongWritable
) is the byte offset of the line in the file and the value (Text
) is the line of text. TextInputFormat
is the default InputFormat
of Hadoop.NLineInputFormat
: This is used for plain text files. NLineInputFormat
splits the input files into logical splits of fixed number of lines. We can use the NLineInputFormat
when we want our map tasks to receive a fixed number of lines as the input. The key (LongWritable
) and value (Text
) records are generated for each line in the split similar to the TextInputFormat
. By default, NLineInputFormat
creates a logical split (and a Map task) per line. The number of lines per split (or key-value records per Map task) can be specified as follows. NLineInputFormat
generates a key-value record for each line of the input text files.NLineInputFormat.setNumLinesPerSplit(job,50);
SequenceFileInputFormat
: For Hadoop Sequence file input data. Hadoop Sequence files store the data as binary key-value pairs and support data compression. SequenceFileInputFormat
is useful when using the result of a previous MapReduce computation in Sequence file format as the input of a MapReduce computation.SequenceFileAsBinaryInputFormat
: This is a subclass of the SequenceInputFormat
that presents the key (BytesWritable
) and the value (BytesWritable
) pairs in raw binary formatSequenceFileAsTextInputFormat
: This is a subclass of the SequenceInputFormat
that presents the key (Text
) and the value (Text
) pairs as stringsDBInputFormat
: This supports reading the input data for MapReduce computation from a SQL table. DBInputFormat
uses the record number as the key (LongWritable
) and the query result record as the value (DBWritable
).We can use the MultipleInputs
feature of Hadoop to run a MapReduce job with multiple input paths, while specifying a different InputFormat
and (optionally) a mapper for each path. Hadoop will route the outputs of the different mappers to the instances of the single reducer implementation of the MapReduce computation. Multiple inputs with different InputFormat
implementations is useful when we want to process multiple data sets with the same meaning but are in different input formats (comma-delimited data set and tab-delimited data set).
We can use the following addInputPath
static method of the MutlipleInputs
class to add the input paths and the respective input formats to the MapReduce computation.
Public static void addInputPath(Job job, Path path, Class<?extendsInputFormat>inputFormatClass)
The following is an example usage of the preceding method.
MultipleInputs.addInputPath(job, path1, CSVInputFormat.class); MultipleInputs.addInputPath(job, path1, TabInputFormat.class);
The multiple inputs feature with both different mappers and InputFormat
is useful when performing a reduce-side join of two or more data sets.
public static void addInputPath(JobConfconf,Path path, Class<?extendsInputFormat>inputFormatClass, Class<?extends Mapper>mapperClass)
The following is an example of using multiple inputs with different input formats and different mapper implementations.
MultipleInputs.addInputPath(job, accessLogPath,TextInputFormat.class, AccessLogMapper.class); MultipleInputs.addInputPath(job, userDataPath,TextInputFormat.class, UserDataMapper.class);