Exploration

As we demonstrated in Chapter 5, Machine Learning Workouts on IBM Cloud, there is plenty of essential functionality common to the pandas data structures to support preprocessing and analysis of your data. In this example though, we are going look at examples of data explorations again but this time using Spark DataFrame methods.

For example, earlier we loaded a data file using Insert pandas DataFrame; this time, we can reload that file using the same steps, but this time selecting Insert SparkSession DataFrame. The code generated will include the import ibmos2spark and from pyspark.sql import SparkSession commands and load the data into SparkSession DataFrame (rather than a pandas DataFrame):

import ibmos2spark
# @hidden_cell
credentials = {
   'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'service_id': 'iam-ServiceId-f9f1f892-3a72-4bdd-9d12-32b5a616dbfa',
   'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token',
   'api_key': 'D2NjbuA02Ra3Pq6OueNW0JZZU6S3MKXOookVfQsKfH3L'
}
configuration_name = 'os_f20250362df648648ee81858c2a341b5_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data_2 = spark.read
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 .option('header', 'true')
 .load(cos.url('2015.CSV', 'chapter6-donotdelete-pr-qy3imqdyi8jv3w'))
df_data_2.take(5)

Running the cell initiates Spark jobs, shows a progress/status for those jobs, and, eventually, the output generated by the .take(5) command:

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application. As a Spark developer, you create SparkSession using the SparkSession.builder method (which gives you access to the Builder API that you use to configure the session).

Of course, we can also use count(), first as well as other statements:

Another interesting and handy analysis method is to show the schema of a DataFrame. You can use the printSchema() function to print out the schema for a SparkR DataFrame in a tree format, as follows:

df_data_2.printSchema()

The preceding command yields the following output:

A schema is the description of the structure of the data. A schema is described using StructType, which is a collection of the StructField objects (that in turn are tuples of names, types, and nullability classifiers).

Using a Spark DataFrame also provides you with the ability to navigate through the data and apply logic. For example, it's not unreasonable or unexpected to want to look at the first two (or first few) rows of your data by running the print command; however, for readability, you might want to add a row of asterisks in between the data rows by using the following code:

for row in df_data_2.take(2):
    print(row)
    print( "*" * 104)

The preceding code generates the following output:

Suppose you are interested in using your SQL skills to perform your analysis?

No problem! You can use SparkSQL with your SparkSession DataFrame object.

However, all SQL statements must be run against a table, so you need to define a table that acts like a pointer to the DataFrame (after you import the SQLContext module):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_data_2.registerTempTable("MyWeather")

Additionally, you'll need to define a new DataFrame object to hold the results of your SQL query and put the SQL statement inside the sqlContext.sql()method. Let's see how that works.

You can run the following cell to select all columns from the table we just created and then print information about the resulting DataFrame and schema of the data:

temp_df =  sqlContext.sql("select * from MyWeather")
print (type(temp_df))
print ("*" * 104)
print (temp_df)

This results in the following output:

Table of Contents for Exploration

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploration