Spark SQL

Spark SQL is a Spark module for processing structured data. It had a humble start, but now it has become the most important Spark library (as DataFrames/Datasets are replacing RDDs).

This chapter is divided into the following recipes:

Understanding the evolution of schema awareness
Understanding the Catalyst optimizer
Inferring schema using case classes
Programmatically specifying the schema
Understanding the Parquet format
Loading and saving data using the JSON format
Loading and saving data from relational databases
Loading and saving data from an arbitrary source
Understanding joins
Analyzing nested structures

We will start with a small journey down memory lane to see how schema awareness has slowly evolved into a Spark framework and has now become the core of it. After this, we will discuss how the Catalyst optimizer, the core engine of Spark, works. In the next two recipes, we will focus on converting data from raw format into DataFrames. Then we will discuss how to seamlessly pull and load data into Parquet, JSON, relational, and other formats. Lastly, we will discuss joins and nested structures.

Table of Contents for Spark SQL

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark SQL