4.3.2 Using Multiple Reducers

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

80 | Big Data Simplied

On the cluster running the Hadoop and MapReduce setup, we can consider that these key

value pairs are sent to the machine in which the reducer process operates.

FIGURE 4.12 Sorting of key value pairs

Split 1ABACB

CB BBBAAAAADC CCCCCD

DCADC

Split 2

Split 3

Now, these key value pairs are sorted such that all values which have the same key are available

together and are then fed to the reducer. The sorting implies that all prole views for a particular

member, say John, are available in a group.

FIGURE 4.13 Reducer function applied to sorted results

Split 1 A BAC B

CB BBBAAAAADC CCCCCDDD

ABCD

DCADC

Split 2

Split 3

The reducer is a code that we have written to sum up the values associated with the same key.

As we can see, there is a lot going on here beyond the map and reduce logic for which we

write code. All these processes are handled completely behind the scenes by the MapReduce

framework collecting these key value pairs together, transferring them across the network to the

cluster node in which the reduce job runs, and sorting them so that the values associated with

the same key appear together.

4.3.2 Using Multiple Reducers

Let us now consider that we want two reducers running on two different nodes. There are now

two partitions to which the keys can be sent. Now, in this scenario, we have to gure out which

key is sent to which reducer and this process is called assigning partitions.

M04 Big Data Simplified XXXX 01.indd 80 5/10/2019 9:58:21 AM

Introducing MapReduce | 81

FIGURE 4.14 Keys assigned to specific partitions

Split 1ABAC

Partition 2

Partition 1

Partition 2

Partition 1

Partition 2

Partition 1

CBADC

DCADC

Split 2

Split 3

Therefore, after the map phase, the MapReduce framework assigns each key to a certain partition.

Now, this is something that can be controlled by the developer. The developer can decide the

amount of parallelism needed by running more reducers. We can decide that the key coded A

goes to partition 1, the code B to partition 2, the code C to partition 1, the code D to partition2.

Thus, each key is assigned to a partition.

FIGURE 4.15 Partition function determines where each key goes

Split 1ABA

ACC

Partition 2

Partition 1

Partition 2

Partition 1

Partition 2

Partition 1

CBADC

DCADC

Split 2

Split 3

Internally, the framework has a partition function that it runs to determine where each key goes.

There is just one job for the partition function and that is to look at the key and determine which

partition or node it belongs to. The manner in which we partition the keys determines the ef-

ciency of MapReduce operation. The partitioning should not be skewed such that one reducer

receives large number of keys and the other reducer receives much less.

So, the cluster manager will distribute the keys, which are the outputs of the map phase to the

right partitions. Notice that A keys are always in partition 1 and B keys always belong to parti-

tion2. And the same is true for the other codes. The number of partitions is equal to the number

M04 Big Data Simplified XXXX 01.indd 81 5/10/2019 9:58:21 AM

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4.3.2 Using Multiple Reducers

Create new playlist

Sign In

Sign Up

Table of Contents for
4.3.2 Using Multiple Reducers