Encoders

Spark 2.x supports a different way of defining schema for complex datatypes. First, let's look at a simple example. Encoders must be imported using the import statement in order for you to use Encoders:

import org.apache.spark.sql.Encoders

Let's look at a simple example of a defined tuple as a datatype to be used in the dataset APIs:

scala> Encoders.product[(Integer, String)].schema.printTreeString
root
|-- _1: integer (nullable = true)
|-- _2: string (nullable = true)

The preceding code looks complicated to use all the time, so we can also define a case class for our needs and then use it.

We can define a case class Record with two fields, an Integer and a String:

scala> case class Record(i: Integer, s: String)
defined class Record

Using Encoders we can easily create a schema on top of the case class, thus allowing us to use the various APIs with ease:

scala> Encoders.product[Record].schema.printTreeString
root
|-- i: integer (nullable = true)
|-- s: string (nullable = true)

All datatypes of Spark SQL are located in the package org.apache.spark.sql.types.

You can access them by using:

import org.apache.spark.sql.types._

You should use the DataTypes object in your code in order to create complex Spark SQL types such as arrays or maps as shown in the following:

scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypes
scala> val arrayType = DataTypes.createArrayType(IntegerType)
arrayType: org.apache.spark.sql.types.ArrayType =
ArrayType(IntegerType,true)

Following are the data types supported in SparkSQL APIs:

Data type	Value type in Scala	API to access or create a data type
`ByteType`	`Byte`	`ByteType`
`ShortType`	`Short`	`ShortType`
`IntegerType`	`Int`	`IntegerType`
`LongType`	`Long`	`LongType`
`FloatType`	`Float`	`FloatType`
`DoubleType`	`Double`	`DoubleType`
`DecimalType`	`java.math.BigDecimal`	`DecimalType`
`StringType`	`String`	`StringType`
`BinaryType`	`Array[Byte]`	`BinaryType`
`BooleanType`	`Boolean`	`BooleanType`
`TimestampType`	`java.sql.Timestamp`	`TimestampType`
`DateType`	`java.sql.Date`	`DateType`
`ArrayType`	`scala.collection.Seq`	`ArrayType(elementType, [containsNull])`
`MapType`	`scala.collection.Map`	`MapType(keyType, valueType,` `[valueContainsNull])Note`: The default value of `valueContainsNull` is `true`.
`StructType`	`org.apache.spark.sql.Row`	`StructType(fields).Note`: Fields is a `Seq` of `StructFields`. Also, two fields with the same name are not allowed.

Table of Contents for Encoders

Create new playlist

Sign In

Sign Up

Table of Contents for
Encoders