Shuffle hash join

The Shuffle hash join is the most basic type of join and is derived from the joins in MapReduce. Let's say we would like to join the review data and tip data for every user. A Shuffle hash join will go through the following steps:

  1. Map through the review DataFrame using user_id, business_id as a key.
  2. Map through the tip DataFrame using user_id, business_id as a key. 
  3. Shuffle review data by user_id, business_id.
  4. Shuffle tip data by user_id, business_id
  5. Join both the datasets using the reduce phase. Data with the same keys will be on the same machine and sorted. 

As in MapReduce, the Shuffle hash join works best when data is not skewed and evenly distributed among the keys. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset