Understanding the Catalyst optimizer

Most of the power of Spark SQL comes from the Catalyst optimizer, so it makes sense to spend some time understanding it. The following diagram shows where exactly the optimization occurs along with the queries:

The Catalyst optimizer primarily leverages functional programming constructs of Scala, such as pattern matching. It offers a general framework for transforming trees, which we use to perform analysis, optimization, planning, and runtime code generation.

This optimizer has two primarilly goals:

  • To make adding new optimization techniques easy
  • To enable external developers to extend the optimizer

Spark SQL uses Catalyst's transformation framework in four phases:

  1. Analyzing a logical plan to resolve references.
  2. Logical plan optimization.
  3. Physical planning.
  4. Code generation, to compile the parts of the query to Java byte-code.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset