Let's understand four components in Spark SQL—SQL, the Data Sources API, the DataFrame API, and the Dataset API.
Spark SQL can write and read data to and from Hive tables using the SQL language. SQL can be used within Java, Scala, Python, R languages, over JDBC/ODBC, or using the command-line option. When SQL is used in programming languages, the results will be converted as DataFrames.
The Data Sources API provides a single interface for reading and writing data using Spark SQL. In addition to the in-built sources that come prepackaged with the Apache Spark distribution, the Data Sources API provides integration for external developers to add custom data sources. All external data sources and other packages can be viewed at http://spark-packages.org/.
Advantages of the Data Source API are:
The DataFrame API is designed to make big data analytics easier for a variety of users. This API is inspired by DataFrames in R and Python (Pandas), but designed for distributed processing of massive datasets to support modern big data analytics. DataFrame can be seen as an extension to the existing RDD API and are an abstraction over RDDs.
Advantages of the DataFrame API are:
The Dataset API introduced in version 1.6 combined the best of RDDs and DataFrames. Datasets use encoders for converting JVM objects to a dataset table representation, which is stored using Spark's Tungsten binary format.
Advantages of the Dataset API are:
The following table shows the differences between SQL, DataFrames, and Datasets in terms of compile time and runtime safety.
SQL |
DataFrames |
Datasets | |
---|---|---|---|
Syntax Errors |
Runtime |
Compile time |
Compile time |
Analysis Errors |
Runtime |
Runtime |
Compile time |