Now that we've imported our data, let's build our graph. To do this, we're going to build the structure for our vertices and edges. At the time of writing, GraphFrames requires a specific naming convention for vertices and edges:
id
. In our case, the vertices of our flight data are the airports. Therefore, we will need to rename the IATA airport code to id
in our airports
DataFrame.src
) and destination (dst
). For our flight data, the edges are the flights, therefore the src
and dst
are the origin and destination columns from the departureDelays_geo
DataFrame.To simplify the edges for our graph, we will create the tripEdges
DataFrame with a subset of the columns available within the departureDelays_Geo
DataFrame. As well, we created a tripVertices
DataFrame that simply renames the IATA
column to id
to match the GraphFrame naming convention:
# Note, ensure you have already installed # the GraphFrames spark-package from pyspark.sql.functions import * from graphframes import * # Create Vertices (airports) and Edges (flights) tripVertices = airports.withColumnRenamed("IATA", "id").distinct() tripEdges = departureDelays_geo.select("tripid", "delay", "src", "dst", "city_dst", "state_dst") # Cache Vertices and Edges tripEdges.cache() tripVertices.cache()
Within Databricks, you can query the data using the display
command. For example, to view the tripEdges
DataFrame, the command is as follows:
display(tripEdges)
The output is as follows:
Now that we have the two DataFrames, we can create a GraphFrame using the GraphFrame
command:
tripGraph = GraphFrame(tripVertices, tripEdges)