82 | Big Data Simplied
of reducers, and all results in one partition will go to the corresponding reducer. Thus, all results
from partition 1 will go to the first reducer. Also note that, one particular key is sent to just one
reducer, where the same key cannot be sent to multiple reducers.
Once we have decided that a particular key goes to a particular reducer, then things are exactly
the same as we saw in the case of a single reducer. The pairs are sorted by key such that all values
for the same key occur together in a group, and it is this input that is passed on to the reduce
phase. The reducer will combine all such values with the same key producing one output for
each key.
FIGURE 4.16 Final reducer output after partition, shuffle and sort
M
M
M
R
Split 1ABA
R
AA
BB
DDDD
AAACCCC
CA
BB
BD
DD
C
ACC
ACC
C
Partition 2
Partition 1
Partition 2
Partition 1
Partition 2
Partition 1
B
CBADC
DCADC
Split 2
Split 3
The developer is completely abstracted from these behind-the-scenes operations. These oper-
ations have specic needs. The rst one we saw was partitioning, and then moving the map
output to the reduce, and then sorting them are called shufe and sort, respectively.
When we imagine the data flow as it goes from the raw data set to the final result, it goes
through a map phase, partitioning, shuffling, sorting and finally, the reduce phase.
Hadoop allows us to write our own partition function and customize how we partition the
data. If the default hash partitioner does not work for a specific requirement, then we can choose
another. However, we must remember that the default hash partitioner is pretty good, so we must
hold a pretty compelling reason to change the partition function and customize it.
As for sorting, the default sort in Hadoop will be different as it is based on the type of key. If it
is a text key, then the keys will be sorted lexicographically. If it is a number key, then the sorting
will be in ascending order.
4.4 OPTIMIZE THE MAP PHASE USING A COMBINER
There are other optimizations that we can perform when we run a MapReduce job.
M04 Big Data Simplified XXXX 01.indd 82 5/10/2019 9:58:21 AM