Shuffle join

Join between two big datasets involves shuffle join where partitions of both left and right datasets are spread across the executors. Shuffles are expensive and it's important to analyze the logic to make sure the distribution of partitions and shuffles is done optimally. The following is an illustration of how shuffle join works internally:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset