Let's start with loading, parsing, and viewing simple flight data. At first, download the NYC flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv. Now let's load and parse the dataset using read.csv() API of PySpark:
# Creating DataFrame from data file in CSV format
df = spark.read.format("com.databricks.spark.csv")
.option("header", "true")
.load("data/nycflights13.csv")
This is pretty similar to reading the libsvm format. Now you can see the resulting DataFrame's structure as follows:
df.printSchema()
The output is as follows:
Figure 8: Schema of the NYC flight dataset
Now let's see a snap of the dataset using the show() method as follows:
df.show()
Now let's view the sample of the data as follows:
Figure 9: Sample of the NYC flight dataset