A DataFrame is used for creating rows and columns of data just like a Relational Database Management System (RDBMS) table. DataFrames are a common data analytics abstraction that was introduced in the R statistical language and then introduced in Python with the proliferation of the Pandas library and the pydata ecosystem. DataFrames provide easy ways to develop applications and higher developer productivity.
Spark SQL DataFrame has richer optimizations under the hood than R or Python DataFrame. They can be created from files, pandas DataFrames, tables in Hive, external databases like MySQL, or RDDs. The DataFrame API is available in Scala, Java, Python, and R.
While DataFrames provided relational operations and higher performance, they lacked type-safety, which led to run-time errors. While it is possible to convert a DataFrame to a Dataset, it required a fair amount of boilerplate code and it was expensive. So, the Dataset API is introduced in version 1.6 and combined the best of both worlds; RDDs and DataFrames - static typing, easier implementation of the function features of RDDs, and the superior performance features of DataFrames.
However, Dataset and DataFrame were separate classes in version 1.6. In version 2.0, DataSet and DataFrame APIs are unified to provide a single API for developers. A DataFrame is a specific Dataset[T], where T=Row type, so DataFrame shares the same methods as Dataset.
The Dataset API is offered in Scala and Java languages. It is not supported in Python and R languages. However, many benefits of the Dataset API are already available in Python and R languages naturally. DataFrame API is available in all four languages; Java, Scala, Python, and R.
Let's understand why RDDs were not enough and led to the creation of DataFrames and Datasets.
Spark computes the closure, serializes, and ships them to executors. This means your code is shipped in raw form without much optimization. There is no way of representing structured data in RDD and querying RDD. Working with RDDs is pretty easy but the code gets messy sometimes dealing with tuples. The following code illustration is for getting the average age of people by name using RDD and DataFrames. DataFrames are easier to use and yet provide superior performance than conventional RDDs because of optimizations done in Catalyst.
input = sc.textFile(hdfs://localhost:8020/data/input).split(" ") input.map(lambda a: (a[0], [int(a[1], 1])) .reduceByKey(lambda a,b: [a[0] + b[0], a[1] + b[1]) .map(lambda a: [a[0], a[1][0] / a[1][1]) .collect() spark.table("emp_table") .groupBy("emp_name") .agg("emp_name", avg("emp_age")) .collect()
In comparison to the Hadoop world, a pure handwritten MapReduce job might be slower than a hive or pig job because of the optimizations done in hive and pig under the hood. DataFrames can be seen in a similar way.
Let's understand the differences between RDD transformations and DataFrame transformations in detail. The following table shows the differences: