Chapter 4. Big Data Analytics with Spark SQL, DataFrames, and Datasets

As per the Spark Summit presentation by Matei Zaharia, creator of Apache Spark (http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote), Spark SQL and DataFrames are the most used components of an entire Spark ecosystem. This indicates Spark SQL is one of the key components used for Big Data Analytics by companies.

Users of Spark have three different APIs to interact with distributed collections of data:

  • RDD API allows users to work with objects of their choice and express transformations as lambda functions
  • DataFrames API provides high-level relational operations and an optimized runtime, at the expense of type-safety
  • Dataset API that combines the worlds of RDD and DataFrames

We have learned how to use RDD API in Chapter 3, Deep Dive into Apache Spark. In this chapter, let's understand the in-depth concepts of Spark SQL including exploring the Data Sources API, the DataFrame API, the Dataset API, and how to perform big data analytics with most common sources, such as Json, Xml, Parquet, ORC, Avro, CSV, JDBC, and Hive. Also, let's understand how to use Spark SQL as a distributed SQL engine.

This chapter is divided into the following sub-topics:

  • History of Spark SQL
  • Architecture of Spark SQL
  • Evolution of DataFrames and Datasets
  • Why Datasets and DataFrames?
  • Analytics with DataFrames
  • Analytics with Dataset API
  • Data Sources API with Xml, Parquet, ORC, Avro, CSV, JDBC, and Hive
  • DataFrame based Spark-on-HBase connector
  • Spark SQL as a distributed SQL engine
  • Hive on Spark

History of Spark SQL

To address the challenges of performance issues of Hive queries, a new project called Shark was introduced into the Spark ecosystem in early versions of Spark. Shark used Spark as an execution engine instead of the MapReduce engine for executing hive queries. Shark was built on the hive codebase using the Hive query compiler to parse hive queries and generate an abstract syntax tree, which is converted to a logical plan with some basic optimizations. Shark applied additional optimizations and created a physical plan of RDD operations, then executed them in Spark. This provided in-memory performance to Hive queries. But, Shark had three major problems to deal with:

  • Shark was suitable to query Hive tables only. Running relational queries on RDDs was not possible
  • Running Hive QL as a string within spark programs was error-prone
  • Hive optimizer was created for the MapReduce paradigm and it was difficult to extend Spark for new data sources and new processing models

Shark was discontinued from version 1.0 and Spark SQL was introduced in version 1.0 in May 2014. In Spark SQL, a new type of RDD called SchemaRDD was introduced that has an associated schema. SchemaRDD offered relational queries and spark functions to be executed on top of them.

Starting with Spark 1.3, SchemaRDD has been renamed DataFrames which is similar to DataFrames in Python or R Statistics language. DataFrame is similar to a relational table, which offers rich Domain Specific Language (DSL) functions, RDD functions, and SQL.

Starting with Spark 1.6, a new Dataset API has been introduced that provides the best features of RDDs and the DataFrame API together. RDDs are useful because of the static typing nature and easy to implement user-defined functions. The DataFrame API provided easy to use functionality and higher performance than RDDs, but it does not support static typing, which is error-prone.

Starting with Spark 2.0, the Dataset and DataFrame APIs are unified to provide a single abstraction of the Dataset API.

The Spark SQL history is depicted in Figure 4.1:

History of Spark SQL

Figure 4.1: Spark SQL history

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset