Executing simple queries

Let's start with a set of simple graph queries to understand flight performance and departure delays.

Determining the number of airports and trips

For example, to determine the number of airports and trips, you can run the following commands:

print "Airports: %d" % tripGraph.vertices.count()
print "Trips: %d" % tripGraph.edges.count()

As you can see from the results, there are 279 airports with 1.36 million trips:

Determining the number of airports and trips

Determining the longest delay in this dataset

To determine the longest delayed flight in the dataset, you can run the following query with the result of 1,642 minutes (that's more than 27 hours!):

tripGraph.edges.groupBy().max("delay")

# Output
+----------+
|max(delay)| 
+----------+ 
|      1642| 
+----------+

Determining the number of delayed versus on-time/early flights

To determine the number of delayed versus on-time (or early) flights, you can run the following queries:

print "On-time / Early Flights: %d" % tripGraph.edges.filter("delay <= 0").count()
print "Delayed Flights: %d" % tripGraph.edges.filter("delay > 0").count()

with the results nothing that almost 43% of the flights were delayed!

Determining the number of delayed versus on-time/early flights

What flights departing Seattle are most likely to have significant delays?

Digging further in this data, let's find out the top five destinations for flights departing from Seattle that are most likely to have significant delays. This can be achieved through the following query:

tripGraph.edges
  .filter("src = 'SEA' and delay > 0")
  .groupBy("src", "dst")
  .avg("delay")
  .sort(desc("avg(delay)"))
  .show(5)

As you can see in the following results: Philadelphia (PHL), Colorado Springs (COS), Fresno (FAT), Long Beach (LGB), and Washington D.C (IAD) are the top five cities with flights delayed originating from Seattle:

What flights departing Seattle are most likely to have significant delays?

What states tend to have significant delays departing from Seattle?

Let's find which states have the longest cumulative delays (with individual delays > 100 minutes) originating from Seattle. This time we will use the display command to review the data:

# States with the longest cumulative delays (with individual
# delays > 100 minutes) (origin: Seattle)
display(tripGraph.edges.filter("src = 'SEA' and delay > 100"))
What states tend to have significant delays departing from Seattle?

Using the Databricks display command, we can also quickly change from this table view to a map view of the data. As can be seen in the following figure, the state with the most cumulative delays originating from Seattle (in this dataset) is California:

What states tend to have significant delays departing from Seattle?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset