Joins

In traditional databases, joins are used to join one transaction table with another lookup table to generate a more complete view. For example, if you have a table of online transactions by customer ID and another table containing the customer city and customer ID, you can use join to generate reports on the transactions by city.

Transactions table: The following table has three columns, the CustomerID, the Purchased item, and how much the customer paid for the item:

CustomerID	Purchased item	Price paid
1	Headphone	25.00
2	Watch	100.00
3	Keyboard	20.00
1	Mouse	10.00
4	Cable	10.00
3	Headphone	30.00

Customer Info table: The following table has two columns, the CustomerID and the City the customer lives in:

CustomerID	City
1	Boston
2	New York
3	Philadelphia
4	Boston

Joining the transaction table with the customer info table will generate a view as follows:

CustomerID	Purchased item	Price paid	City
1	Headphone	25.00	Boston
2	Watch	100.00	New York
3	Keyboard	20.00	Philadelphia
1	Mouse	10.00	Boston
4	Cable	10.00	Boston
3	Headphone	30.00	Philadelphia

Now, we can use this joined view to generate a report of Total sale price by City:

City	#Items	Total sale price
Boston	3	45.00
Philadelphia	2	50.00
New York	1	100.00

Joins are an important function of Spark SQL, as they enable you to bring two datasets together, as seen previously. Spark, of course, is not only meant to generate reports, but is used to process data on a petabyte scale to handle real-time streaming use cases, machine learning algorithms, or plain analytics. In order to accomplish these goals, Spark provides the API functions needed.

A typical join between two datasets takes place using one or more keys of the left and right datasets and then evaluates a conditional expression on the sets of keys as a Boolean expression. If the result of the Boolean expression returns true, then the join is successful, else the joined DataFrame will not contain the corresponding join.

The join API has 6 different implementations:

join(right: dataset[_]): DataFrame
Condition-less inner join

join(right: dataset[_], usingColumn: String): DataFrame
Inner join with a single column

join(right: dataset[_], usingColumns: Seq[String]): DataFrame 
Inner join with multiple columns

join(right: dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
Join with multiple columns and a join type (inner, outer,....)

join(right: dataset[_], joinExprs: Column): DataFrame
Inner Join using a join expression

join(right: dataset[_], joinExprs: Column, joinType: String): DataFrame 
Join using a Join expression and a join type (inner, outer, ...)

We will use one of the APIs to understand how to use join APIs ; however, you can choose to use other APIs depending on the use case:

def   join(right: dataset[_], joinExprs: Column, joinType: String): DataFrame 
Join with another DataFrame using the given join expression

right: Right side of the join.
joinExprs: Join expression.
joinType : Type of join to perform. Default is inner join

// Scala:
import org.apache.spark.sql.functions._
import spark.implicits._
df1.join(df2, $"df1Key" === $"df2Key", "outer")

Note that joins will be covered in detail in the next few sections.

Table of Contents for Joins

Create new playlist

Sign In

Sign Up

Table of Contents for
Joins