Evolution of DataFrames and Datasets

A DataFrame is used for creating rows and columns of data just like a Relational Database Management System (RDBMS) table. DataFrames are a common data analytics abstraction that was introduced in the R statistical language and then introduced in Python with the proliferation of the Pandas library and the pydata ecosystem. DataFrames provide easy ways to develop applications and higher developer productivity.

Spark SQL DataFrame has richer optimizations under the hood than R or Python DataFrame. They can be created from files, pandas DataFrames, tables in Hive, external databases like MySQL, or RDDs. The DataFrame API is available in Scala, Java, Python, and R.

While DataFrames provided relational operations and higher performance, they lacked type-safety, which led to run-time errors. While it is possible to convert a DataFrame to a Dataset, it required a fair amount of boilerplate code and it was expensive. So, the Dataset API is introduced in version 1.6 and combined the best of both worlds; RDDs and DataFrames - static typing, easier implementation of the function features of RDDs, and the superior performance features of DataFrames.

However, Dataset and DataFrame were separate classes in version 1.6. In version 2.0, DataSet and DataFrame APIs are unified to provide a single API for developers. A DataFrame is a specific Dataset[T], where T=Row type, so DataFrame shares the same methods as Dataset.

The Dataset API is offered in Scala and Java languages. It is not supported in Python and R languages. However, many benefits of the Dataset API are already available in Python and R languages naturally. DataFrame API is available in all four languages; Java, Scala, Python, and R.

What's wrong with RDDs?

Let's understand why RDDs were not enough and led to the creation of DataFrames and Datasets.

Spark computes the closure, serializes, and ships them to executors. This means your code is shipped in raw form without much optimization. There is no way of representing structured data in RDD and querying RDD. Working with RDDs is pretty easy but the code gets messy sometimes dealing with tuples. The following code illustration is for getting the average age of people by name using RDD and DataFrames. DataFrames are easier to use and yet provide superior performance than conventional RDDs because of optimizations done in Catalyst.

input = sc.textFile(hdfs://localhost:8020/data/input).split("	")
input.map(lambda a: (a[0], [int(a[1], 1])) 
     .reduceByKey(lambda a,b: [a[0] + b[0], a[1] + b[1]) 
     .map(lambda a: [a[0], a[1][0] / a[1][1]) 
     .collect()

spark.table("emp_table")  
      .groupBy("emp_name")  
      .agg("emp_name", avg("emp_age"))  
      .collect()

In comparison to the Hadoop world, a pure handwritten MapReduce job might be slower than a hive or pig job because of the optimizations done in hive and pig under the hood. DataFrames can be seen in a similar way.

RDD Transformations versus Dataset and DataFrames Transformations

Let's understand the differences between RDD transformations and DataFrame transformations in detail. The following table shows the differences:

RDD Transformations

Dataset and DataFrame Transformations

Transformations are lazily evaluated by a spark action.

Dataset and DataFrame transformations are lazy as well.

Operators in a spark job are used to construct a Directed Acyclic Graph (DAG) with optimizations such as combining or re-arranging operators.

Dataset or DataFrame creates an Abstract Syntax Tree (AST), which is parsed by Catalyst to check and improve using both rules-based optimization and cost-based optimizations.

Spark computes the closure, serializes, and ships them to executors. Optimizations must be done by user.

Optimized code generated transformation is shipped to the executors.

Lowest API on Spark.

High-level API on Spark.

Schema is not needed for RDDs.

Schema is imposed while creating Datasets and DataFrames.

No optimizations for predicate pushdown. It uses Hadoop Input format for Hadoop sources with no built-in optimizations.

Makes use of smart sources using the Data Sources API, which enables predicate pushdown optimizations at sources.

Performance varies in different languages.

Same performance in Scala, Python, Java, or R.

Reading and combining multiple sources is difficult.

Reading and combining multiple sources is easy because of underlying Datasources API.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset