82 | Big Data Simplied
of reducers, and all results in one partition will go to the corresponding reducer. Thus, all results
from partition 1 will go to the first reducer. Also note that, one particular key is sent to just one
reducer, where the same key cannot be sent to multiple reducers.
Once we have decided that a particular key goes to a particular reducer, then things are exactly
the same as we saw in the case of a single reducer. The pairs are sorted by key such that all values
for the same key occur together in a group, and it is this input that is passed on to the reduce
phase. The reducer will combine all such values with the same key producing one output for
each key.
FIGURE 4.16 Final reducer output after partition, shuffle and sort
M
M
M
R
Split 1ABA
R
AA
BB
BB
DDDD
AAACCCC
AC
CA
BB
BD
DD
C
ACC
ACC
C
Partition 2
Partition 1
Partition 2
Partition 1
Partition 2
Partition 1
B
CBADC
DCADC
Split 2
Split 3
The developer is completely abstracted from these behind-the-scenes operations. These oper-
ations have specic needs. The rst one we saw was partitioning, and then moving the map
output to the reduce, and then sorting them are called shufe and sort, respectively.
When we imagine the data flow as it goes from the raw data set to the final result, it goes
through a map phase, partitioning, shuffling, sorting and finally, the reduce phase.
Hadoop allows us to write our own partition function and customize how we partition the
data. If the default hash partitioner does not work for a specific requirement, then we can choose
another. However, we must remember that the default hash partitioner is pretty good, so we must
hold a pretty compelling reason to change the partition function and customize it.
As for sorting, the default sort in Hadoop will be different as it is based on the type of key. If it
is a text key, then the keys will be sorted lexicographically. If it is a number key, then the sorting
will be in ascending order.
4.4 OPTIMIZE THE MAP PHASE USING A COMBINER
There are other optimizations that we can perform when we run a MapReduce job.
M04 Big Data Simplified XXXX 01.indd 82 5/10/2019 9:58:21 AM
Introducing MapReduce | 83
FIGURE 4.17 Dataset and expected outcome for combiner discussion
Date Time
05/06/2018
05/06/2018
05/06/2018
06/06/2018
06/06/2018
06/06/2018
Rainfall (mm)
400
420
390
410
380
430
00:00
01:00
02:00
00:00
01:00
02:00
Date
05/06/2018
06/06/2018
Maximum Rainfall (mm)
420
430
Let us consider a simple example, by calculating the maximum amount of rainfall in a day. Let us
say we have a bunch of rainfall recordings for every hour of each particular day. The objective is
to run a MapReduce job to nd the maximum rainfall in a day.
FIGURE 4.18 Standard MapReduce for rainfall example
M
M
M
{‘05/06/2018’, 400}
{‘05/06/2018’, 420}
{‘05/06/2018’, 390}
{‘05/06/2018’, 410}
{‘05/06/2018’, 420}
{‘06/06/2018’, 430}
{‘06/06/2018’, 380}
{‘06/06/2018’, 430}
R
The original data records are not going to be static on a single machine. All the recorded rain-
falls are split into three sets on different machines. The mapper process runs on each of these
machines and produces an output for every process. The output of the map phase is simply a
date followed by the corresponding rainfall data. The hourly information is of no interest to us.
In addition, in the reduce phase, we run a max function on the rainfall data for a particular date.
Now let us see if we can optimize this process by moving more of the processing away from
the reducer and to the mapper phase. As we have seen in the above example, every mapper has
generated a set of key value pairs. We now have the mapper output produce the maximum tem-
perature for a particular date. By this, we can implement some of the combining logic which we
do typically in the reduce phase on the mapper on individual nodes which run the mapper pro-
cess. This combiner, which works in parallel on the multiple nodes in which the mapper process
M04 Big Data Simplified XXXX 01.indd 83 5/10/2019 9:58:22 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset