Speeding up PySpark with DataFrames

The significance of DataFrames and the Catalyst Optimizer (and Project Tungsten) is the increase in performance of PySpark queries when compared to non-optimized RDD queries. As shown in the following figure, prior to the introduction of DataFrames, Python query speeds were often twice as slow as the same Scala queries using RDD. Typically, this slowdown in query performance was due to the communications overhead between Python and the JVM:

Speeding up PySpark with DataFrames

Source: Introducing DataFrames in Apache-spark for Large Scale Data Science at http://bit.ly/2blDBI1

With DataFrames, not only was there a significant improvement in Python performance, there is now performance parity between Python, Scala, SQL, and R.

Tip

It is important to note that while, with DataFrames, PySpark is often significantly faster, there are some exceptions. The most prominent one is the use of Python UDFs, which results in round-trip communication between Python and the JVM. Note, this would be the worst-case scenario which would be similar if the compute was done on RDDs.

Python can take advantage of the performance optimizations in Spark even while the codebase for the Catalyst Optimizer is written in Scala. Basically, it is a Python wrapper of approximately 2,000 lines of code that allows PySpark DataFrame queries to be significantly faster.

Altogether, Python DataFrames (as well as SQL, Scala DataFrames, and R DataFrames) are all able to make use of the Catalyst Optimizer (as per the following updated diagram):

Speeding up PySpark with DataFrames

Note

For more information, please refer to the blog post Introducing DataFrames in Apache Spark for Large Scale Data Science at http://bit.ly/2blDBI1, as well as Reynold Xin's Spark Summit 2015 presentation, From DataFrames to Tungsten: A Peek into Spark's Future at http://bit.ly/2bQN92T.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset