Spark SQL

Spark SQL is a Spark module for processing structured data. It had a humble start, but now it has become the most important Spark library (as DataFrames/Datasets are replacing RDDs).

This chapter is divided into the following recipes:

  • Understanding the evolution of schema awareness
  • Understanding the Catalyst optimizer
  • Inferring schema using case classes
  • Programmatically specifying the schema
  • Understanding the Parquet format
  • Loading and saving data using the JSON format
  • Loading and saving data from relational databases
  • Loading and saving data from an arbitrary source
  • Understanding joins
  • Analyzing nested structures

We will start with a small journey down memory lane to see how schema awareness has slowly evolved into a Spark framework and has now become the core of it. After this, we will discuss how the Catalyst optimizer, the core engine of Spark, works. In the next two recipes, we will focus on converting data from raw format into DataFrames. Then we will discuss how to seamlessly pull and load data into Parquet, JSON, relational, and other formats. Lastly, we will discuss joins and nested structures. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset