To showcase the types of queries you can do with DataFrames, let's look at the use case of on-time flight performance. We will analyze the Airline On-Time Performance and Causes of Flight Delays: On-Time Data (http://bit.ly/2ccJPPM), and join this with the airports dataset, obtained from the Open Flights Airport, airline, and route data (http://bit.ly/2ccK5hw), to better understand the variables associated with flight delays.
For this section, we will be using Databricks Community Edition (a free offering of the Databricks product), which you can get at https://databricks.com/try-databricks. We will be using visualizations and pre-loaded datasets within Databricks to make it easier for you to focus on writing the code and analyzing the results.
If you would prefer to run this on your own environment, you can find the datasets available in our GitHub repository for this book at https://github.com/drabastomek/learningPySpark.
We will first process the source airports and flight performance datasets by specifying their file path location and importing them using SparkSession:
# Set File Paths flightPerfFilePath = "/databricks-datasets/flights/departuredelays.csv" airportsFilePath = "/databricks-datasets/flights/airport-codes-na.txt" # Obtain Airports dataset airports = spark.read.csv(airportsFilePath, header='true', inferSchema='true', sep=' ') airports.createOrReplaceTempView("airports") # Obtain Departure Delays dataset flightPerf = spark.read.csv(flightPerfFilePath, header='true') flightPerf.createOrReplaceTempView("FlightPerformance") # Cache the Departure Delays dataset flightPerf.cache()
Note that we're importing the data using the CSV reader (com.databricks.spark.csv
), which works for any specified delimiter (note that the airports data is tab-delimited, while the flight performance data is comma-delimited). Finally, we cache the flight dataset so subsequent queries will be faster.
One of the more common tasks with DataFrames/SQL is to join two different datasets; it is often one of the more demanding operations (from a performance perspective). With DataFrames, a lot of the performance optimizations for these joins are included by default:
# Query Sum of Flight Delays by City and Origin Code # (for Washington State) spark.sql(""" select a.City, f.origin, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.State = 'WA' group by a.City, f.origin order by sum(f.delay) desc""" ).show()
In our scenario, we are querying the total delays by city and origin code for the state of Washington. This will require joining the flight performance data with the airports data by International Air Transport Association (IATA) code. The output of the query is as follows:
Using notebooks (such as Databricks, iPython, Jupyter, and Apache Zeppelin), you can more easily execute and visualize your queries. In the following examples, we will be using the Databricks notebook. Within our Python notebook, we can use the %sql
function to execute SQL statements within that notebook cell:
%sql -- Query Sum of Flight Delays by City and Origin Code (for Washington State) select a.City, f.origin, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.State = 'WA' group by a.City, f.origin order by sum(f.delay) desc
This is the same as the previous query, but due to formatting, easier to read. In our Databricks notebook example, we can quickly visualize this data into a bar chart:
Let's continue visualizing our data, but broken down by all states in the continental US:
%sql -- Query Sum of Flight Delays by State (for the US) select a.State, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.Country = 'USA' group by a.State
The output bar chart is as follows:
But, it would be cooler to view this data as a map; click on the bar chart icon at the bottom-left of the chart, and you can choose from many different native navigations, including a map:
One of the key benefits of DataFrames is that the information is structured similar to a table. Therefore, whether you are using notebooks or your favorite BI tool, you will be able to quickly visualize your data.
You can find the full list of pyspark.sql.DataFrame
methods at http://bit.ly/2bkUGnT.
You can find the full list of pyspark.sql.functions
at http://bit.ly/2bTAzLT.