In traditional databases, joins are used to join one transaction table with another lookup table to generate a more complete view. For example, if you have a table of online transactions by customer ID and another table containing the customer city and customer ID, you can use join to generate reports on the transactions by city.
Transactions table: The following table has three columns, the CustomerID, the Purchased item, and how much the customer paid for the item:
CustomerID | Purchased item | Price paid |
1 | Headphone | 25.00 |
2 | Watch | 100.00 |
3 | Keyboard | 20.00 |
1 | Mouse | 10.00 |
4 | Cable | 10.00 |
3 | Headphone | 30.00 |
Customer Info table: The following table has two columns, the CustomerID and the City the customer lives in:
CustomerID | City |
1 | Boston |
2 | New York |
3 | Philadelphia |
4 | Boston |
Joining the transaction table with the customer info table will generate a view as follows:
CustomerID | Purchased item | Price paid | City |
1 | Headphone | 25.00 | Boston |
2 | Watch | 100.00 | New York |
3 | Keyboard | 20.00 | Philadelphia |
1 | Mouse | 10.00 | Boston |
4 | Cable | 10.00 | Boston |
3 | Headphone | 30.00 | Philadelphia |
Now, we can use this joined view to generate a report of Total sale price by City:
City | #Items | Total sale price |
Boston | 3 | 45.00 |
Philadelphia | 2 | 50.00 |
New York | 1 | 100.00 |
Joins are an important function of Spark SQL, as they enable you to bring two datasets together, as seen previously. Spark, of course, is not only meant to generate reports, but is used to process data on a petabyte scale to handle real-time streaming use cases, machine learning algorithms, or plain analytics. In order to accomplish these goals, Spark provides the API functions needed.
A typical join between two datasets takes place using one or more keys of the left and right datasets and then evaluates a conditional expression on the sets of keys as a Boolean expression. If the result of the Boolean expression returns true, then the join is successful, else the joined DataFrame will not contain the corresponding join.
The join API has 6 different implementations:
join(right: dataset[_]): DataFrame
Condition-less inner join
join(right: dataset[_], usingColumn: String): DataFrame
Inner join with a single column
join(right: dataset[_], usingColumns: Seq[String]): DataFrame
Inner join with multiple columns
join(right: dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
Join with multiple columns and a join type (inner, outer,....)
join(right: dataset[_], joinExprs: Column): DataFrame
Inner Join using a join expression
join(right: dataset[_], joinExprs: Column, joinType: String): DataFrame
Join using a Join expression and a join type (inner, outer, ...)
We will use one of the APIs to understand how to use join APIs ; however, you can choose to use other APIs depending on the use case:
Note that joins will be covered in detail in the next few sections.