DataFrame scenario – on-time flight performance

To showcase the types of queries you can do with DataFrames, let's look at the use case of on-time flight performance. We will analyze the Airline On-Time Performance and Causes of Flight Delays: On-Time Data (http://bit.ly/2ccJPPM), and join this with the airports dataset, obtained from the Open Flights Airport, airline, and route data (http://bit.ly/2ccK5hw), to better understand the variables associated with flight delays.

Tip

For this section, we will be using Databricks Community Edition (a free offering of the Databricks product), which you can get at https://databricks.com/try-databricks. We will be using visualizations and pre-loaded datasets within Databricks to make it easier for you to focus on writing the code and analyzing the results.

If you would prefer to run this on your own environment, you can find the datasets available in our GitHub repository for this book at https://github.com/drabastomek/learningPySpark.

Preparing the source datasets

We will first process the source airports and flight performance datasets by specifying their file path location and importing them using SparkSession:

# Set File Paths
flightPerfFilePath = 
"/databricks-datasets/flights/departuredelays.csv"
airportsFilePath = 
"/databricks-datasets/flights/airport-codes-na.txt"

# Obtain Airports dataset
airports = spark.read.csv(airportsFilePath, header='true', inferSchema='true', sep='	')
airports.createOrReplaceTempView("airports")

# Obtain Departure Delays dataset
flightPerf = spark.read.csv(flightPerfFilePath, header='true')
flightPerf.createOrReplaceTempView("FlightPerformance")

# Cache the Departure Delays dataset 
flightPerf.cache()

Note that we're importing the data using the CSV reader (com.databricks.spark.csv), which works for any specified delimiter (note that the airports data is tab-delimited, while the flight performance data is comma-delimited). Finally, we cache the flight dataset so subsequent queries will be faster.

Joining flight performance and airports

One of the more common tasks with DataFrames/SQL is to join two different datasets; it is often one of the more demanding operations (from a performance perspective). With DataFrames, a lot of the performance optimizations for these joins are included by default:

# Query Sum of Flight Delays by City and Origin Code 
# (for Washington State)
spark.sql("""
select a.City, 
f.origin, 
sum(f.delay) as Delays 
from FlightPerformance f 
join airports a 
on a.IATA = f.origin
where a.State = 'WA'
group by a.City, f.origin
order by sum(f.delay) desc"""
).show()

In our scenario, we are querying the total delays by city and origin code for the state of Washington. This will require joining the flight performance data with the airports data by International Air Transport Association (IATA) code. The output of the query is as follows:

Joining flight performance and airports

Using notebooks (such as Databricks, iPython, Jupyter, and Apache Zeppelin), you can more easily execute and visualize your queries. In the following examples, we will be using the Databricks notebook. Within our Python notebook, we can use the %sql function to execute SQL statements within that notebook cell:

%sql
-- Query Sum of Flight Delays by City and Origin Code (for Washington State)
select a.City, f.origin, sum(f.delay) as Delays
  from FlightPerformance f
    join airports a
      on a.IATA = f.origin
 where a.State = 'WA'
 group by a.City, f.origin
 order by sum(f.delay) desc

This is the same as the previous query, but due to formatting, easier to read. In our Databricks notebook example, we can quickly visualize this data into a bar chart:

Joining flight performance and airports

Visualizing our flight-performance data

Let's continue visualizing our data, but broken down by all states in the continental US:

%sql
-- Query Sum of Flight Delays by State (for the US)
select a.State, sum(f.delay) as Delays
  from FlightPerformance f
    join airports a
      on a.IATA = f.origin
 where a.Country = 'USA'
 group by a.State

The output bar chart is as follows:

Visualizing our flight-performance data

But, it would be cooler to view this data as a map; click on the bar chart icon at the bottom-left of the chart, and you can choose from many different native navigations, including a map:

Visualizing our flight-performance data

One of the key benefits of DataFrames is that the information is structured similar to a table. Therefore, whether you are using notebooks or your favorite BI tool, you will be able to quickly visualize your data.

Tip

You can find the full list of pyspark.sql.DataFrame methods at http://bit.ly/2bkUGnT.

You can find the full list of pyspark.sql.functions at http://bit.ly/2bTAzLT.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset