Encoders

Spark 2.x supports a different way of defining schema for complex datatypes. First, let's look at a simple example. Encoders must be imported using the import statement in order for you to use Encoders:

import org.apache.spark.sql.Encoders

Let's look at a simple example of a defined tuple as a datatype to be used in the dataset APIs:

scala> Encoders.product[(Integer, String)].schema.printTreeString
root
|-- _1: integer (nullable = true)
|-- _2: string (nullable = true)

The preceding code looks complicated to use all the time, so we can also define a case class for our needs and then use it.

We can define a case class Record with two fields, an Integer and a String:

scala> case class Record(i: Integer, s: String)
defined class Record

Using Encoders we can easily create a schema on top of the case class, thus allowing us to use the various APIs with ease:

scala> Encoders.product[Record].schema.printTreeString
root
|-- i: integer (nullable = true)
|-- s: string (nullable = true)

All datatypes of Spark SQL are located in the package org.apache.spark.sql.types.

You can access them by using:

import org.apache.spark.sql.types._

You should use the DataTypes object in your code in order to create complex Spark SQL types such as arrays or maps as shown in the following:

scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypes
scala> val arrayType = DataTypes.createArrayType(IntegerType)
arrayType: org.apache.spark.sql.types.ArrayType =
ArrayType(IntegerType,true)

Following are the data types supported in SparkSQL APIs:

Data type
Value type in Scala
API to access
or create a data type
ByteType Byte ByteType
ShortType Short ShortType
IntegerType Int IntegerType
LongType Long LongType
FloatType Float FloatType
DoubleType Double

DoubleType

DecimalType  java.math.BigDecimal DecimalType
StringType String  StringType
BinaryType Array[Byte]  BinaryType
BooleanType Boolean

BooleanType

TimestampType java.sql.Timestamp

TimestampType

DateType java.sql.Date  DateType
ArrayType scala.collection.Seq ArrayType(elementType, [containsNull])
MapType scala.collection.Map

MapType(keyType, valueType,
[valueContainsNull])Note: The default value of valueContainsNull is true.

StructType org.apache.spark.sql.Row

StructType(fields).Note: Fields is a Seq of StructFields. Also, two fields with the same name are not allowed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset