Now let's move on to the concept of extraction. The print command doesn't really show the data in a very useful format. So, instead of using our Spark DataFrame, we could use the pandas open source data analytics library to create a pandas DataFrame that shows the data in a table.
Now we can look at an example that will make our SQL coders happy.
Import the pandas library and use the .toPandas() method to show the SQL query results:
import pandas as pd
sqlContext.sql("select STATION, METRIC from MyWeather limit 2").toPandas()
Running the preceding commands results in the following output:
Here is another example of simple SQL query execution, this time counting the number of metrics recorded for each weather station and then creating a list of the weather stations ordered by the number of metric records for the weather station:
query = """
select
STATION ,
count(*) as metric_count
from MyWeather
group by STATION
order by count(*) desc
"""
sqlContext.sql(query).toPandas()
The preceding code gives us the following output:
You are encouraged to experiment with additional variations of SQL statements and then review the results in real time.