Getting ready

Spark SQL is a component of the Spark ecosystem, introduced in Spark 1.0 for the first time. It incorporates a project named Shark, which was an attempt to make Hive run on Spark.

Hive is essentially a relational abstraction; it converts SQL queries into MapReduce jobs. See the following figure:

Shark replaced the MapReduce part with Spark while retaining most of the code base:

Initially, it worked fine, but very soon, Spark developers hit roadblocks and could not optimize it any further. Finally, they decided to write the SQL engine from scratch, and this gave birth to Spark SQL. Refer to the following image for a better understanding:

Spark SQL took care of all the performance challenges, but it had to provide compatibility with Hive, and for that reason, a new wrapper context, HiveContext, was created on top of SQLContext.

Spark SQL supports accessing of data using standard SQL queries and HiveQL, a SQL-like query language that Hive uses. In this chapter, we will explore the different features of Spark SQL. It supports a subset of HiveQL as well as a subset of SQL 92. It runs SQL/HiveQL queries alongside or replaces the existing Hive deployments.

Running SQL is only a part of the reason for the creation of Spark SQL. One big reason is that it helps create and run Spark programs faster. It lets developers to write less code, the program to read less data, and the Catalyst optimizer to do all the heavy lifting.

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready