4.4 Optimize the Map Phase Using a Combiner

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

82 | Big Data Simplied

of reducers, and all results in one partition will go to the corresponding reducer. Thus, all results

from partition 1 will go to the first reducer. Also note that, one particular key is sent to just one

reducer, where the same key cannot be sent to multiple reducers.

Once we have decided that a particular key goes to a particular reducer, then things are exactly

the same as we saw in the case of a single reducer. The pairs are sorted by key such that all values

for the same key occur together in a group, and it is this input that is passed on to the reduce

phase. The reducer will combine all such values with the same key producing one output for

each key.

FIGURE 4.16 Final reducer output after partition, shuffle and sort

Split 1ABA

DDDD

AAACCCC

ACC

Partition 2

Partition 1

Partition 2

Partition 1

Partition 2

Partition 1

CBADC

DCADC

Split 2

Split 3

The developer is completely abstracted from these behind-the-scenes operations. These oper-

ations have specic needs. The rst one we saw was partitioning, and then moving the map

output to the reduce, and then sorting them are called shufe and sort, respectively.

When we imagine the data flow as it goes from the raw data set to the final result, it goes

through a map phase, partitioning, shuffling, sorting and finally, the reduce phase.

Hadoop allows us to write our own partition function and customize how we partition the

data. If the default hash partitioner does not work for a specific requirement, then we can choose

another. However, we must remember that the default hash partitioner is pretty good, so we must

hold a pretty compelling reason to change the partition function and customize it.

As for sorting, the default sort in Hadoop will be different as it is based on the type of key. If it

is a text key, then the keys will be sorted lexicographically. If it is a number key, then the sorting

will be in ascending order.

4.4 OPTIMIZE THE MAP PHASE USING A COMBINER

There are other optimizations that we can perform when we run a MapReduce job.

M04 Big Data Simplified XXXX 01.indd 82 5/10/2019 9:58:21 AM

Introducing MapReduce | 83

FIGURE 4.17 Dataset and expected outcome for combiner discussion

Date Time

05/06/2018

06/06/2018

Rainfall (mm)

400

420

390

410

380

430

00:00

01:00

02:00

00:00

01:00

02:00

Date

05/06/2018

06/06/2018

Maximum Rainfall (mm)

420

430

Let us consider a simple example, by calculating the maximum amount of rainfall in a day. Let us

say we have a bunch of rainfall recordings for every hour of each particular day. The objective is

to run a MapReduce job to nd the maximum rainfall in a day.

FIGURE 4.18 Standard MapReduce for rainfall example

{‘05/06/2018’, 400}

{‘05/06/2018’, 420}

{‘05/06/2018’, 390}

{‘05/06/2018’, 410}

{‘05/06/2018’, 420}

{‘06/06/2018’, 430}

{‘06/06/2018’, 380}

{‘06/06/2018’, 430}

The original data records are not going to be static on a single machine. All the recorded rain-

falls are split into three sets on different machines. The mapper process runs on each of these

machines and produces an output for every process. The output of the map phase is simply a

date followed by the corresponding rainfall data. The hourly information is of no interest to us.

In addition, in the reduce phase, we run a max function on the rainfall data for a particular date.

Now let us see if we can optimize this process by moving more of the processing away from

the reducer and to the mapper phase. As we have seen in the above example, every mapper has

generated a set of key value pairs. We now have the mapper output produce the maximum tem-

perature for a particular date. By this, we can implement some of the combining logic which we

do typically in the reduce phase on the mapper on individual nodes which run the mapper pro-

cess. This combiner, which works in parallel on the multiple nodes in which the mapper process

M04 Big Data Simplified XXXX 01.indd 83 5/10/2019 9:58:22 AM

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4.4 Optimize the Map Phase Using a Combiner

Create new playlist

Sign In

Sign Up

Table of Contents for
4.4 Optimize the Map Phase Using a Combiner