Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Shuffle hash join

The Shuffle hash join is the most basic type of join and is derived from the joins in MapReduce. Let's say we would like to join the review data and tip data for every user. A Shuffle hash join will go through the following steps:

Map through the review DataFrame using user_id, business_id as a key.
Map through the tip DataFrame using user_id, business_id as a key.
Shuffle review data by user_id, business_id.
Shuffle tip data by user_id, business_id.
Join both the datasets using the reduce phase. Data with the same keys will be on the same machine and sorted.

As in MapReduce, the Shuffle hash join works best when data is not skewed and evenly distributed among the keys.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Shuffle hash join

Create new playlist

Sign In

Sign Up

Table of Contents for
Shuffle hash join