The first step is to define a trait and method that describe the transformation of data by the computation units of a workflow. The data transformation is the foundation of any workflow for processing and classifying a dataset, training and validating a model, and displaying results.
There are two symbolic models used for defining a data processing or data transformation:
The simplest form of data transformation is morphism between the two U
and V
types. The data transformation enforces a contract for validating an input and returning either a value or an error. From now on, we use the following convention:
PartialFunction
type that is returned by the data transformation. A MatchErr
error is thrown in case the input value does not meet the required condition (contract).Try[V]
for which an exception is returned in case of an error.Partial functions enable developers to implement methods that address the most common (primary) use cases for which input values have been tested. All other nontrivial use cases (or input values) generate a MatchErr
exception. At a later stage in the development cycle, the developer can implement the code to handle the less common use cases.
Runtime validation of a partial function
It is a good practice to validate if a partial function is defined for a specific value of the argument:
for { pfn.isDefinedAt(input) value <- pfn(input) } yield { … }
This preemptive approach allows the developer to select an alternative method or a full function. It is an efficient alternative to catch a MathErr
exception. The validation of a partial function is omitted throughout the book for the sake of clarity.
Therefore, the signature of a data transformation is defined as follows:
def |> : PartialFunction[U, Try[V]]
The objective is to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation. The transformation on a dataset is performed using a model or configuration that is fully defined by the user, which is illustrated in the following diagram:
The transformation of an explicit configuration or model, config
, is defined as an ETransform
abstract class parameterized by the T
type of the model:
abstract class ETransform[T](val config: T) { //explicit model type U // type of input type V // type of output def |> : PartialFunction[U, Try[V]] // data transformation }
The input U
type and output V
type have to be defined in the subclasses of ETransform
. The |>
transform operator returns a partial function that can be reused for different input values.
The creation of a class that implements a specific transformation using an explicit configuration is quite simple: all you need is the definition of an input/output U
/V
type and an implementation of the |>
transformation method.
Let's consider the extraction of data from a financial source, DataSource
, that takes a list of functions that convert some text fields, Fields
, into a Double
value as the input and produce a list of observations of the XSeries[Double]
type. The extraction parameters are defined in the DataSourceConfig
class:
class DataSource( config: DataSourceConfig, //1 srcFilter: Option[Fields => Boolean]= None) extends ETransform[DataSourceConfig](config) { //2 type U = List[Fields => Double] //3 type V = List[XSeries[Double]] //4 override def |> : PartialFunction[U, Try[V]] = { //5 case u: U if(!u.isEmpty) => … } }
The DataSourceConfig
configuration is explicitly provided as an argument of the constructor for DataSource
(line 1
). The constructor implements the basic type and data transformation associated with an explicit model (line 2
). The class defines the U
type of input values (line 3
), V
type of output values (line 4
), and |>
transformation method that returns a partial function (line 5
).
The DataSource class
The Data extraction section of the Appendix A, Basic Concepts describes the DataSource
class functionality. The DataSource
class is used throughout the book.
Data transformations using an explicit model or configuration constitute a category with monadic operations. The monad associated with the ETransform
class subclasses the definition of the higher kind, _Monad
:
private val eTransformMonad = new _Monad[ETransform] { override def unit[T](t:T) = eTransform(t) //6 override def map[T,U](m: ETransform[T]) //7 (f: T => U): ETransform[U] = eTransform( f(m.config) ) override def flatMap[T,U](m: ETransform[T]) //8 (f: T =>ETransform[U]): ETransform[U] = f(m.config) }
The singleton eTransformMonad
implements the following basic monadic operators introduced in the Monads section under Abstraction in Chapter 1, Getting Started:
unit
method is used to instantiate ETransform
(line 6
)map
is used to transform an ETransform
object by morphing its elements (line 7
)flatMap
is used to transform an ETransform
object by instantiating its elements (line 8
)For practical purposes, an implicit class is created to convert an ETransform
object to its associated monad, allowing transparent access to the unit
, map
, and flatMap
methods:
implicit class eTransform2Monad[T](fct: ETransform[T]) { def unit(t: T) = eTransformMonad.unit(t) final def map[U](f: T => U): ETransform[U] = eTransformMonad.map(fct)(f) final def flatMap[U](f: T => ETransform[U]): ETransform[U] = eTransformMonad.flatMap(fct)(f) }
Supervised learning models are extracted from a training set. Transformations, such as classification or regression use the implicit models to process the input data, as illustrated in the following diagram:
The transformation for a model implicitly extracted from the training data is defined as an abstract ITransform
class parameterized by the T
type of observations, xt
:
abstract class ITransform[T](val xt: Vector[T]) { //Model input type V // type of output def |> : PartialFunction[T, Try[V]] // data transformation }
The type of the data collection is Vector
, which is an immutable and effective container. An ITransform
type is created by defining the T
type of the observation, the V
output of the data transformation, and the |>
method that implements the transformation, usually a classification or regression. Let' s consider the support vector machine algorithm, SVM
, to illustrate the implementation of a data transformation using an implicit model:
class SVM[T <: AnyVal]( //9 config: SVMConfig, xt: Vector[Array[T]], expected: Vector[Double])(implicit f: T => Double) extends ITransform[Array[T]](xt) {//10 type V = Double //11 override def |> : PartialFunction[Array[T], Try[V]] = { //12 case x: Array[T] if(x.length == data.size) => ... }
The support vector machine is a discriminative supervised learning algorithm described in Chapter 8, Kernel Models and Support Vector Machines. A support vector machine, SVM
, is instantiated with a configuration and training set: the xt
observations and expected
data (line 9
). Contrary to the explicit model, the config
configuration does not define the model used in the data transformation; the model is implicitly generated from the training set of the xt
input data and expected
values. An SVM
instance is created as an ITransform
(line 10
) by specifying the V
output type (line 11
) and overriding the |>
transformation method (line 12
).
The |>
classification method produces a partial function that takes an x
observation as an input and returns the prediction value of a Double
type.
Similar to the explicit transformation, we define the monadic operation for the ITransform
by overriding the unit
(line 13
), map
(line 14
), and flatMap
(line 15
) methods:
private val iTransformMonad = new _Monad[ITransform] { override def unit[T](t: T) = iTransform(Vector[T](t)) //13 override def map[T,U](m: ITransform[T])(f: T => U): ITransform[U] = iTransform( m.xt.map(f) ) //14 override def flatMap[T,U](m: ITransform[T]) (f: T=>ITransform[U]): ITransform[U] = iTransform(m.xt.flatMap(t => f(t).xt)) //15 }
Finally, let's create an implicit class to automatically convert an ITransform
object into its associated monad so that it can access the unit
, map
, and flatMap
monad methods transparently:
implicit class iTransform2Monad[T](fct: ITransform[T]) { def unit(t: T) = iTransformMonad.unit(t) final def map[U](f: T => U): ITransform[U] = iTransformMonad.map(fct)(f) final def flatMap[U](f: T => ITransform[U]): ITransform[U] = iTransformMonad.flatMap(fct)(f) def filter(p: T =>Boolean): ITransform[T] = //16 iTransform(fct.xt.filter(p)) }
The filter
method is strictly not an operator of the monad (line 16
). However, it is commonly included to constrain (or guard) a sequence of transformation (for example, for comprehension closure). As stated in the Presentation section under Source code in Chapter 1, Getting Started, code related to exceptions, error checking, and validation of arguments is omitted.
Immutable transformations
The model for a data transformation (or a processing unit or classifier) class should be immutable. Any modification will alter the integrity of the model or parameters used to process data. In order to ensure that the same model is used in processing the input data for the entire lifetime of a transformation, we do the following:
ETransform
is defined as an argument of its constructor.ITransform
generates the model from a given training set. The model has to be rebuilt from the training set (not altered), if it provides an incorrect outcome or prediction.Models are created by the constructor of classifiers or data transformation classes to ensure their immutability. The design of an immutable transformation is described in the Design template for immutable classifiers section under Scala programming of the Appendix A, Basic Concepts.