Single mapper reducer job

Single mapper reducer jobs are used in aggregation use cases. If we want to do some aggregation the such as count, by key, then this pattern is used:

Scenario

Counting the total/average temperature of cities

Map (Key, Value)

Key: city

Value: Their temperatures

Reduce

Group by city, and take average temperature for each city

Now let's look at a complete example of a single mapper reducer only job. For this, we will simply try to output the cityID and average temperature from the temperature.csv file seen earlier.

The following is the code:

package io.somethinglikethis;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;


public class SingleMapperReducer
{
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "City Temperature Job");
        job.setMapperClass(TemperatureMapper.class);
        job.setReducerClass(TemperatureReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    /*
    Date,Id,Temperature
    2018-01-01,1,21
    2018-01-01,2,22
    */
    private static class TemperatureMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String txt = value.toString();
            String[] tokens = txt.split(",");
            String date = tokens[0];
            String id = tokens[1].trim();
            String temperature = tokens[2].trim();
            if (temperature.compareTo("Temperature") != 0)
                context.write(new Text(id), new IntWritable(Integer.parseInt(temperature)));
        }
    }

    private static class TemperatureReducer
            extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException, InterruptedException {
            int sum = 0;
            int n = 0;
            for (IntWritable val : values) {
                sum += val.get();
                n +=1;
            }
            result.set(sum/n);
            context.write(key, result);
        }
    }
}

Now, run this command:

hadoop jar target/uber-mapreduce-1.0-SNAPSHOT.jar io.somethinglikethis.SingleMapperReducer /user/normal/temperatures.csv /user/normal/output/SingleMapperReducer

The job will run and you should be able to see output as shown in the following code showing the output counters:

Map-Reduce Framework
    Map input records=28
    Map output records=27
    Map output bytes=162
    Map output materialized bytes=222
    Input split bytes=115
    Combine input records=0
    Combine output records=0
    Reduce input groups=6
    Reduce shuffle bytes=222
    Reduce input records=27
    Reduce output records=6
    Spilled Records=54
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=12
    Total committed heap usage (bytes)=1080557568

This shows that 27 records were output from mapper, and there are six output records from reducer. You will be able to check this using the HDFS browser, simply by using http://localhost:9870 and jumping into the output directory shown under /user/normal/output, as shown in the following screenshot:

Figure: screenshot showing how to check output in the output directory

Now, find the SingleMapperReducer folder, go into this directory and then drilldown as in SingleMapper section; then using the head/tail option in the preceding screenshot, you can view the contents of the file, as shown in the following screenshot:

This shows the output of the SingleMapperReducer job, writing each row's cityID and average temperature for each cityID.

You can also use command line to view the contents of output hdfs dfs -cat /user/normal/output/SingleMapperReducer/part-r-00000.

The output file contents are as shown in the following code:

This concludes the SingleMapperReducer job execution and the output is as expected.

Table of Contents for Single mapper reducer job

Create new playlist

Sign In

Sign Up

Table of Contents for
Single mapper reducer job