Spark 2.x supports a different way of defining schema for complex datatypes. First, let's look at a simple example. Encoders must be imported using the import statement in order for you to use Encoders:
import org.apache.spark.sql.Encoders
Let's look at a simple example of a defined tuple as a datatype to be used in the dataset APIs:
scala> Encoders.product[(Integer, String)].schema.printTreeString
root
|-- _1: integer (nullable = true)
|-- _2: string (nullable = true)
The preceding code looks complicated to use all the time, so we can also define a case class for our needs and then use it.
We can define a case class Record with two fields, an Integer and a String:
scala> case class Record(i: Integer, s: String)
defined class Record
Using Encoders we can easily create a schema on top of the case class, thus allowing us to use the various APIs with ease:
scala> Encoders.product[Record].schema.printTreeString
root
|-- i: integer (nullable = true)
|-- s: string (nullable = true)
All datatypes of Spark SQL are located in the package org.apache.spark.sql.types.
You can access them by using:
import org.apache.spark.sql.types._
You should use the DataTypes object in your code in order to create complex Spark SQL types such as arrays or maps as shown in the following:
scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypes
scala> val arrayType = DataTypes.createArrayType(IntegerType)
arrayType: org.apache.spark.sql.types.ArrayType =
ArrayType(IntegerType,true)
Following are the data types supported in SparkSQL APIs:
Data type
|
Value type in Scala
|
API to access
or create a data type |
ByteType | Byte | ByteType |
ShortType | Short | ShortType |
IntegerType | Int | IntegerType |
LongType | Long | LongType |
FloatType | Float | FloatType |
DoubleType | Double |
DoubleType |
DecimalType | java.math.BigDecimal | DecimalType |
StringType | String | StringType |
BinaryType | Array[Byte] | BinaryType |
BooleanType | Boolean |
BooleanType |
TimestampType | java.sql.Timestamp |
TimestampType |
DateType | java.sql.Date | DateType |
ArrayType | scala.collection.Seq | ArrayType(elementType, [containsNull]) |
MapType | scala.collection.Map |
MapType(keyType, valueType, |
StructType | org.apache.spark.sql.Row |
StructType(fields).Note: Fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |