Let's start with a set of simple graph queries to understand flight performance and departure delays.
For example, to determine the number of airports and trips, you can run the following commands:
print "Airports: %d" % tripGraph.vertices.count() print "Trips: %d" % tripGraph.edges.count()
As you can see from the results, there are 279 airports with 1.36 million trips:
To determine the longest delayed flight in the dataset, you can run the following query with the result of 1,642 minutes (that's more than 27 hours!):
tripGraph.edges.groupBy().max("delay") # Output +----------+ |max(delay)| +----------+ | 1642| +----------+
To determine the number of delayed versus on-time (or early) flights, you can run the following queries:
print "On-time / Early Flights: %d" % tripGraph.edges.filter("delay <= 0").count() print "Delayed Flights: %d" % tripGraph.edges.filter("delay > 0").count()
with the results nothing that almost 43% of the flights were delayed!
Digging further in this data, let's find out the top five destinations for flights departing from Seattle that are most likely to have significant delays. This can be achieved through the following query:
tripGraph.edges .filter("src = 'SEA' and delay > 0") .groupBy("src", "dst") .avg("delay") .sort(desc("avg(delay)")) .show(5)
As you can see in the following results: Philadelphia (PHL), Colorado Springs (COS), Fresno (FAT), Long Beach (LGB), and Washington D.C (IAD) are the top five cities with flights delayed originating from Seattle:
Let's find which states have the longest cumulative delays (with individual delays > 100 minutes) originating from Seattle. This time we will use the display
command to review the data:
# States with the longest cumulative delays (with individual # delays > 100 minutes) (origin: Seattle) display(tripGraph.edges.filter("src = 'SEA' and delay > 100"))
Using the Databricks display
command, we can also quickly change from this table view to a map view of the data. As can be seen in the following figure, the state with the most cumulative delays originating from Seattle (in this dataset) is California: