Optimizing joins

This topic was covered briefly when discussing Spark SQL, but it is a good idea to discuss it here again as joins are highly responsible for optimization challenges. 

There are primarily three types of joins in Spark:

  • Shuffle hash join (default):
    • Classic map-reduce type join
    • Shuffle both datasets based on output key
    • During reduce, join the datasets for same output key
  • Broadcast hash join:
    • When one dataset is small enough to fit in memory
  • Cartesian join
    • When every row of one table is joined with every row of the other table

The easiest optimization is that if one of the datasets is small enough to fit in memory, it should be broadcast (broadcast join) to every compute node. This use case is very common as data needs to be combined with side data like a dictionary all the time.

Mostly, joins are slow due to too much data being shuffled over the network. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset