Catalyst Optimizer refresh

As noted in Chapter 1, Understanding Spark, one of the primary reasons the Spark SQL engine is so fast is because of the Catalyst Optimizer. For readers with a database background, this diagram looks similar to the logical/physical planner and cost model/cost-based optimization of a relational database management system (RDBMS):

Catalyst Optimizer refresh

The significance of this is that, as opposed to immediately processing the query, the Spark engine's Catalyst Optimizer compiles and optimizes a logical plan and has a cost optimizer that determines the most efficient physical plan generated.

Note

As noted in earlier chapters, while the Spark SQL Engine has both rules-based and cost-based optimizations that include (but are not limited to) predicate push down and column pruning. Targeted for the Apache Spark 2.2 release, the jira item [SPARK-16026] Cost-based Optimizer Framework at https://issues.apache.org/jira/browse/SPARK-16026 is an umbrella ticket to implement a cost-based optimizer framework beyond broadcast join selection. For more information, please refer to the Design Specification of Spark Cost-Based Optimization at http://bit.ly/2li1t4T.

As part of Project Tungsten, there are further improvements to performance by generating byte code (code generation or codegen) instead of interpreting each row of data. Find more details on Tungsten in the Project Tungsten section in Chapter 1, Understanding Spark.

As previously noted, the optimizer is based on functional programming constructs and was designed with two purposes in mind: to ease the adding of new optimization techniques and features to Spark SQL, and to allow external developers to extend the optimizer (for example, adding data-source-specific rules, support for new data types, and so on).

Note

For more information, please refer to Michael Armbrust's excellent presentation, Structuring Spark: SQL DataFrames, Datasets, and Streaming at http://bit.ly/2cJ508x.

For further understanding of the Catalyst Optimizer, please refer to Deep Dive into Spark SQL's Catalyst Optimizer at http://bit.ly/2bDVB1T.

Also, for more information on Project Tungsten, please refer to Project Tungsten: Bringing Apache Spark Closer to Bare Metal at http://bit.ly/2bQIlKY, and Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop at http://bit.ly/2bDWtnc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset