Creating DataFrames

Typically, you will create DataFrames by importing data using SparkSession (or calling spark in the PySpark shell).

Tip

In Spark 1.x versions, you typically had to use sqlContext.

In future chapters, we will discuss how to import data into your local file system, Hadoop Distributed File System (HDFS), or other cloud storage systems (for example, S3 or WASB). For this chapter, we will focus on generating your own DataFrame data directly within Spark or utilizing the data sources already available within Databricks Community Edition.

Note

For instructions on how to sign up for the Community Edition of Databricks, see the bonus chapter, Free Spark Cloud Offering.

First, instead of accessing the file system, we will create a DataFrame by generating the data. In this case, we'll first create the stringJSONRDD RDD and then convert it into a DataFrame. This code snippet creates an RDD comprised of swimmers (their ID, name, age, and eye color) in JSON format.

Generating our own JSON data

Below, we will generate initially generate the stringJSONRDD RDD:

stringJSONRDD = sc.parallelize(("""
  { "id": "123",
"name": "Katie",
"age": 19,
"eyeColor": "brown"
  }""",
"""{
"id": "234",
"name": "Michael",
"age": 22,
"eyeColor": "green"
  }""", 
"""{
"id": "345",
"name": "Simone",
"age": 23,
"eyeColor": "blue"
  }""")
)

Now that we have created the RDD, we will convert this into a DataFrame by using the SparkSession read.json method (that is, spark.read.json(...)). We will also create a temporary table by using the .createOrReplaceTempView method.

Note

In Spark 1.x, this method was.registerTempTable, which is being deprecated as part of Spark 2.x.

Creating a DataFrame

Here is the code to create a DataFrame:

swimmersJSON = spark.read.json(stringJSONRDD)

Creating a temporary table

Here is the code for creating a temporary table:

swimmersJSON.createOrReplaceTempView("swimmersJSON")

As noted in the previous chapters, many RDD operations are transformations, which are not executed until an action operation is executed. For example, in the preceding code snippet, the sc.parallelize is a transformation that is executed when converting from an RDD to a DataFrame by using spark.read.json. Notice that, in the screenshot of this code snippet notebook (near the bottom left), the Spark job is not executed until the second cell containing the spark.read.json operation.

Tip

These are screenshots from Databricks Community Edition, but all the code samples and Spark UI screenshots can be executed/viewed in any flavor of Apache Spark 2.x.

To further emphasize the point, in the right pane of the following figure, we present the DAG graph of execution.

Note

A great resource to better understand the Spark UI DAG visualization is the blog post Understanding Your Apache Spark Application Through Visualization at http://bit.ly/2cSemkv.

In the following screenshot, you can see the Spark job' sparallelize operation is from the first cell generating the RDD stringJSONRDD, while the map and mapPartitions operations are the operations required to create the DataFrame:

Creating a temporary table

Spark UI of the DAG visualization of the spark.read.json(stringJSONRDD) job.

In the following screenshot, you can see the stages for the parallelize operation are from the first cell generating the RDD stringJSONRDD, while the map and mapPartitions operations are the operations required to create the DataFrame:

Creating a temporary table

Spark UI of the DAG visualization of the stages within the spark.read.json(stringJSONRDD) job.

It is important to note that parallelize, map, and mapPartitions are all RDD transformations. Wrapped within the DataFrame operation, spark.read.json (in this case), are not only the RDD transformations, but also the action which converts the RDD into a DataFrame. This is an important call out, because even though you are executing DataFrame operations, to debug your operations you will need to remember that you will be making sense of RDD operations within the Spark UI.

Note that creating the temporary table is a DataFrame transformation and not executed until a DataFrame action is executed (for example, in the SQL query to be executed in the following section).

Note

DataFrame transformations and actions are similar to RDD transformations and actions in that there is a set of operations that are lazy (transformations). But, in comparison to RDDs, DataFrames operations are not as lazy, primarily due to the Catalyst Optimizer. For more information, please refer to Holden Karau and Rachel Warren's book High Performance Spark, http://highperformancespark.com/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset