MapReduce

MapReduce is a programming pattern used by Apache Hadoop. Hadoop MapReduce works in providing the systems that can store, process, and mine huge data with parallel multi node clusters in a scalable, reliable, and error-absorbing inexpensive distributed system. In MapReduce, the data analysis and data processing are split into individual phases called the Map phase and Reduce phase as represented in the following figure:

In the preceding diagram, the word count process is handled by MapReduce with multiple phases. The set of words in Input are first split into three nodes in the (K1, V1) process. These three split nodes communicate with the corresponding mapping nodes with List (K2, V2). The Mapping nodes take the responsibility of invoking the shuffling process where each word from Mapping nodes is passed to the corresponding Shuffling nodes as K2, List(V2). From the Shuffling nodes, the corresponding Reducing phase is invoked where the count is calculated for each word. Finally, the result is captured in the list (K3, V3) node.

Let's review how to implement the preceding Hadoop MapReduce example as a program.

Note: Make sure you have Hadoop with JDK installed in your system to practice this example.
  1. In Eclipse, create a core Java project as HadoopMapReduceDemo.
  2. Create a Hadoop package under the source directory.
  3. Add the following libraries to the project in the project build path:
    hadoop-core.jar, commons-cli.jar, and commons-logging.jar.
  4. The Hadoop libraries can be added to the project using Maven entries from: https://mvnrepository.com/artifact/org.apache.hadoop
  5. Write the following MapReduceWordCount Java program:

 

        package hadoop;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class MapReduceWordCount {
public static void main(String[] args) throws Exception {
Configuration config = new Configuration();
String[] argFiles = new GenericOptionsParser(
config, args).getRemainingArgs();
Path inputFilePath = new Path(argFiles[0]);
Path outputFilePath = new Path(argFiles[1]);
Job mapReduceJob = new Job(config, "wordcount");
mapReduceJob.setJarByClass(MapReduceWordCount.class);
mapReduceJob.setMapperClass(MapperForWordCount.class);
mapReduceJob.setReducerClass(ReducerForWordCount.class);
mapReduceJob.setOutputKeyClass(Text.class);
mapReduceJob.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(mapReduceJob, inputFilePath);
FileOutputFormat.setOutputPath(mapReduceJob,
outputFilePath);
System.exit(mapReduceJob.waitForCompletion(true) ? 0 : 1);
}

public static class MapperForWordCount extends Mapper<
LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key,
Text wordText, Context context)
throws IOException, InterruptedException {
String wordsAsString = wordText.toString();
String[] wordsAsArray = wordsAsString.split(",");
for (String word : wordsAsArray) {
Text hadoopText = new Text(word.toUpperCase().trim());
IntWritable count = new IntWritable(1);
context.write(hadoopText, count);
}
}
}

public static class ReducerForWordCount extends Reducer<
Text, IntWritable, Text, IntWritable> {
public void reduce(Text word, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int CountOfWords = 0;
for (IntWritable value : values) {
CountOfWords += value.get();
}
context.write(word, new IntWritable(CountOfWords));
}
}
}
  1. The preceding program contains three classes, the driving component (main class – MapReduceWordCount), the Mapper component (MapperForWordCount) extending Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>, and Reducer component (ReducerForWordCount) extending the Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.
  2. Right-click on the HadoopMapReduceDemo project followed by as Jar.
  1. Generate a text file with words to be counted, and name it wordCountInput as follows:

  1. Move this file to the Hadoop system through the following terminal command:
      [practice@localhost ~]$ hadoop fs -put wordCountInput wordCountInput
  1. To execute the Hadoop MapReduce program, the following command needs to be executed:
      [practice@localhost ~]$ hadoop jar HadoopMapReduceDemo.jar hadoop. MapReduceWordCount wordCountInput wordCountOutput
  1. View the word count output result from the output file with the following command:
        [practice @localhost ~]$ hadoop fs -cat wordCountOutput/part-r-
00000

BEAR 2
CAR 3
DEER 2
RIVER 2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset