Typically, you will create DataFrames by importing data using SparkSession (or calling spark
in the PySpark shell).
In future chapters, we will discuss how to import data into your local file system, Hadoop Distributed File System (HDFS), or other cloud storage systems (for example, S3 or WASB). For this chapter, we will focus on generating your own DataFrame data directly within Spark or utilizing the data sources already available within Databricks Community Edition.
First, instead of accessing the file system, we will create a DataFrame by generating the data. In this case, we'll first create the stringJSONRDD
RDD and then convert it into a DataFrame. This code snippet creates an RDD comprised of swimmers (their ID, name, age, and eye color) in JSON format.
Below, we will generate initially generate the stringJSONRDD
RDD:
stringJSONRDD = sc.parallelize((""" { "id": "123", "name": "Katie", "age": 19, "eyeColor": "brown" }""", """{ "id": "234", "name": "Michael", "age": 22, "eyeColor": "green" }""", """{ "id": "345", "name": "Simone", "age": 23, "eyeColor": "blue" }""") )
Now that we have created the RDD, we will convert this into a DataFrame by using the SparkSession read.json
method (that is, spark.read.json(...)
). We will also create a temporary table by using the .createOrReplaceTempView
method.
Here is the code for creating a temporary table:
swimmersJSON.createOrReplaceTempView("swimmersJSON")
As noted in the previous chapters, many RDD operations are transformations, which are not executed until an action operation is executed. For example, in the preceding code snippet, the sc.parallelize
is a transformation that is executed when converting from an RDD to a DataFrame by using spark.read.json
. Notice that, in the screenshot of this code snippet notebook (near the bottom left), the Spark job is not executed until the second cell containing the spark.read.json
operation.
To further emphasize the point, in the right pane of the following figure, we present the DAG graph of execution.
A great resource to better understand the Spark UI DAG visualization is the blog post Understanding Your Apache Spark Application Through Visualization at http://bit.ly/2cSemkv.
In the following screenshot, you can see the Spark job' sparallelize
operation is from the first cell generating the RDD stringJSONRDD
, while the map
and mapPartitions
operations are the operations required to create the DataFrame:
In the following screenshot, you can see the stages for the parallelize
operation are from the first cell generating the RDD stringJSONRDD
, while the map
and mapPartitions
operations are the operations required to create the DataFrame:
It is important to note that parallelize
, map
, and mapPartitions
are all RDD transformations. Wrapped within the DataFrame operation, spark.read.json
(in this case), are not only the RDD transformations, but also the action which converts the RDD into a DataFrame. This is an important call out, because even though you are executing DataFrame operations, to debug your operations you will need to remember that you will be making sense of RDD operations within the Spark UI.
Note that creating the temporary table is a DataFrame transformation and not executed until a DataFrame action is executed (for example, in the SQL query to be executed in the following section).
DataFrame transformations and actions are similar to RDD transformations and actions in that there is a set of operations that are lazy (transformations). But, in comparison to RDDs, DataFrames operations are not as lazy, primarily due to the Catalyst Optimizer. For more information, please refer to Holden Karau and Rachel Warren's book High Performance Spark, http://highperformancespark.com/.