A workflow computational model

Monads are very useful for manipulating and chaining data transformations using implicit configurations or explicit models. However, they are restricted to a single morphism T => U type. More complex and flexible workflows require weaving transformations of different types using a generic factory pattern.

Traditional factory patterns rely on a combination of composition and inheritance and do not provide developers with the same level of flexibility as stackable traits.

In this section, we introduce you to the concept of modeling using mixins and a variant of the cake pattern to provide a workflow with three degrees of configurability.

Supporting mathematical abstractions

Stackable traits enable developers to follow a strict mathematical formalism while implementing a model in Scala. Scientists use a universally accepted template to solve a mathematical problem:

  1. Declare the variables relevant to the problem.
  2. Define a model (equations, algorithms, formulas, and so on) as the solution to the problem.
  3. Instantiate the variables and execute the model to solve the problem.

Let's consider the example of the concept of kernel functions (described in the Kernel functions section in Chapter 8, Kernel Models and Support Vector Machines), a model that consists of a composition of two mathematical functions and its potential implementation in Scala.

Step 1 – variable declaration

The implementation consists of wrapping (scope) the two functions into traits and defining these functions as abstract values.

The mathematical formalism is as follows:

Step 1 – variable declaration

The Scala implementation is as follows:

type V = Vector[Double]
trait F { val f: V => V}
trait G { val g: V => Double }

Step 2 – model definition

The model is defined as the composition of the two functions. The G and F stack of traits describe the type of compatible functions that can be composed using the self-referenced self: G with F constraint.

The formalism will be h = f o g.

The Scala implementation is as follows:

class H {self: G with F => def apply(v:V): Double =g(f(v))}

Step 3 – instantiation

The model is executed once the f and g variables are instantiated.

The formalism will be as follows:

Step 3 – instantiation

The Scala implementation is as follows:

val h = new H with G with F {
  val f: V => V = (v: V) => v.map(Math.exp(_))
  val g: V => Double = (v: V) => v.sum
}

Note

Lazy value triggers

In the preceding example, the value of h(v) = g(f(v)) can be automatically computed as soon as g and f are initialized, by declaring h a lazy value.

Clearly, Scala preserves the formalism of mathematical models, making it easier for scientists and developers to migrate their existing projects written in scientific-oriented languages, such as R.

Note

Emulation of R

Most data scientists use the R language to create models and apply learning strategies. They may consider Scala as an alternative to R in some cases, as Scala preserves the mathematical formalism used in models implemented in R.

Let's extend the concept preservation of mathematical formalism to the dynamic creation of workflows using traits. The design pattern described in the next section is sometimes referred to as the Cake pattern.

Composing mixins to build a workflow

This section presents the key constructs behind the Cake pattern. A workflow composed of configurable data transformations requires a dynamic modularization (substitution) of the different stages of the workflow.

Note

Traits and mixins

Mixins are traits that are stacked against a class. The composition of mixins and the Cake pattern described in this section are important for defining the sequences of data transformations. However, the topic is not directly related to machine learning and so you can skip this section.

The Cake pattern is an advanced class composition pattern that uses mixin traits to meet the demands of a configurable computation workflow. It is also known as stackable modification traits [2:4].

This is not an in-depth analysis of the stackable trait injection and self-reference in Scala. There are few interesting articles on dependencies injection that are worth a look [2:5].

Java relies on packages tightly coupled with the directory structure and prefixed to modularize the code base. Scala provides developers with a flexible and reusable approach to create and organize modules: traits. Traits can be nested, mixed with classes, stacked, and inherited.

Understanding the problem

Dependency injection is a fancy name for a reverse look-up and binding to dependencies. Let's consider a simple application that requires data preprocessing, classification, and validation. A simple implementation using traits looks like this:

val app = new Classification with Validation with PreProcessing { 
   val filter = .. 
}

If, at a later stage, you need to use an unsupervised clustering algorithm instead of a classifier, then the application has to be rewired:

val app = new Clustering with Validation with PreProcessing { 
    val filter = ..  
}

This approach results in code duplication and lack of flexibility. Moreover, the filter class member needs to be redefined for each new class in the composition of the application. The problem arises when there is a dependency between traits used in the composition. Let's consider the case for which the filter depends on the validation methodology.

Note

Mixins linearization [2:6]

The linearization or invocation of methods between mixins follows a right-to-left and base-to-subtype pattern:

  • Trait B extends A
  • Trait C extends A
  • Class M extends N with C with B

The Scala compiler implements the linearization as A => B => C => N.

Although you can define filter as an abstract value, it still has to be redefined each time a new validation type is introduced. The solution is to use the self type in the definition of the newly composed PreProcessingWithValidation trait:

trait PreProcessiongWithValidation extends PreProcessing {
   self: Validation => val filter = ..
}

The application is built by stacking the PreProcessingWithValidation mixin against the Classification class:

val app = new Classification with PreProcessingWithValidation {
   val validation: Validation
}

Note

Overriding def with val

It is advantageous to override the declaration of a method with a declaration of a value with the same signature. Contrary to a value that is assigned once for all during instantiation, a method may return a different value for each invocation. A def is a proc that can be redefined as a def, val, or lazy val. Therefore, you should not override a value declaration with a method with the same signature:

trait Validator { val g = (n: Int) =>  }trait MyValidator extends Validator { def g(n: Int) = …} //WRONG 

Let's adapt and generalize this pattern to construct a boilerplate template in order to create dynamic computational workflows.

Defining modules

The first step is to generate different modules to encapsulate different types of data transformation.

Note

Use case for describing the cake pattern

It is difficult to build an example of a real-world workflow using classes and algorithms introduced later in the book. The following simple example is realistic enough to illustrate the different components of the Cake pattern.

Let's define a sequence of the three parameterized modules that each define a specific data transformation using the explicit configuration of the Etransform type:

  • Sampling: This is used to extract a sample from raw data
  • Normalization: This is used to normalize the sampled data over [0, 1]
  • Aggregation: This is used to aggregate or reduce the data

The code will be as follows:

trait Sampling[T,A,B] { 
  val sampler: ETransform[T] { type U = A; type V = B }
}
trait Normalization[T,A,B] { 
  val normalizer: ETransform[T] { type U = A; type V = B }
  }
trait Aggregation[T,A,B] { 
  val aggregator: ETransform[T] { type U = A; type V = B }
}

The modules contain a single abstract value. One characteristic of the Cake pattern is to enforce strict modularity by initializing the abstract values with the type encapsulated in the module. One of the objectives in building the framework is allowing developers to create data transformation (inherited from ETransform) independently from any workflow.

Note

Scala traits and Java packages

There is a major difference between Scala and Java in terms of modularity. Java packages constrain developers into following a strict syntax that requires, for instance, the source file to have the same name as the class it contains. Scala modules based on stackable traits are far more flexible.

Instantiating the workflow

The next step is to write the different modules into a workflow. This is achieved by using the self reference to the stack of the three traits defined in the previous section:

class Workflow[T,U,V,W,Z] {
  self: Sampling[T,U,V] with 
         Normalization[T,V,W] with 
           Aggregation[T,W,Z] =>
    def |> (u: U): Try[Z] = for {
      v <- sampler |> u
      w <- normalizer |> v
      z <- aggregator |> w
    } yield z
}

A picture is worth a thousand words; the following UML class diagram illustrates the workflow factory (or Cake) design pattern:

Instantiating the workflow

The UML class diagram of the workflow factory

Finally, the workflow is instantiated by dynamically initializing the sampler, normalizer, and aggregator abstract values of the transformation as long as the signature (input and output types) matches the parameterized types defined in each module (line 1):

type Dbl_F = Function1[Double, Double]
val samples = 100; val normRatio = 10; val splits = 4

val workflow = new Workflow[Int, Dbl_F, DblVector, DblVector,Int] 
      with Sampling[Int, Dbl_F, DblVector] 
         with Normalization[Int, DblVector, DblVector] 
            with Aggregation[Int, DblVector, Int] {
    val sampler = new ETransform[Int](samples) { /* .. */} //1
    val normalizer = new ETransform[Int](normRatio) { /*  .. */}
    val aggregator = new ETransform[Int](splits) {/*  .. */}
}

Let's implement the data transformation function for each of the three modules/traits by assigning a transformation to the abstract values.

The first transformation, sampler, samples a f function with frequency as 1/samples over the interval [0, 1]. The second transformation, normalizer, normalizes the data over the range [0, 1] using the Stats class introduced in the next chapter. The last transformation, aggregator, extracts the index of the large sample (value 1.0):

val sampler = new ETransform[Int](samples) { //2
  type U = Dbl_F  //3
  type V = DblVector  //4
  override def |> : PartialFunction[U, Try[V]] = { 
    case f: U => 
     Try(Vector.tabulate(samples)(n =>f(1.0*n/samples))) //5
  }
}

The sampler transformation uses a single model or configuration parameter, sample, (line 2). The U type of an input is defined as Double => Double (line 3) and the V type of an output is defined as a vector of floating point values, DblVector (line 4). In this particular case, the transformation consists of applying the input f function to a vector of increasing normalized values (line 5).

The normalizer and aggregator transforms follow the same design pattern as sampler:

val normalizer = new ETransform[Int](normRatio) {
  type U = DblVector;  type V = DblVector
  override def |> : PartialFunction[U, Try[V]] = { case x: U 
    if(x.size >0) => Try((Stats[Double](x)).normalize)
  }
}
val aggregator = new ETransform[Int](splits) {
  type U = DblVector; type V = Int
  override def |> : PartialFunction[U, Try[V]] = case x: U 
    if(x.size > 0) => Try(Range(0,x.size).find(x(_)==1.0).get)
  }
}

The instantiation of the transformation function follows the template described in the Explicit models section in this chapter.

The workflow is now ready to process any function as an input:

val g = (x: Double) => Math.log(x+1.0) + Random.nextDouble
Try( workflow |> g )  //6

The workflow is executed by providing the input g function to the first sampler mixin (line 6).

Scala's strong type checking catches any inconsistent data types at compilation time. It reduces the development cycle because runtime errors are more difficult to track down.

Note

Mixins composition for ITransform

We arbitrary selected a data transformation using an explicit ETransform configuration to illustrate the concept of mixins composition. The same pattern applies to the implicit ITransform data transformation.

Modularization

The last step is the modularization of the workflow. For complex scientific computations, you need to be able to do the following:

  1. Select the appropriate workflow as a sequence of modules or tasks according to the objective of the execution (regression, classification, clustering, and so on).
  2. Select the appropriate algorithm to fulfill a task according to the data (noisy data, an incomplete training set, and so on).
  3. Select the appropriate implementation of the algorithm according to the environment (distributed with a high-latency network, single host, and so on).
    Modularization

    An Illustration of the dynamic creation of a workflow from modules/traits

Let's consider a simple preprocessing task defined in the PreprocessingModule module. The module (or task) is declared as a trait to hide its internal workings from other modules. The preprocessing task is executed by a preprocessor of a Preprocessor type. We arbitrary list two algorithms: the exponential moving average of the ExpMovingAverage type and the discrete Fourier transform low pass filter of the DFTFilter type as a potential preprocessor:

trait PreprocessingModule[T] {
  trait Preprocessor[T] { //7
    def execute(x: Vector[T]): Try[DblVector] 
  } 
  val preprocessor: Preprocessor[T]//8

  class ExpMovingAverage[T <: AnyVal]( //9
      p: Int)
      (implicit num: Numeric[T], f: T =>Double) 
    extends Preprocessor[T] {
  
    val expMovingAvg = filtering.ExpMovingAverage[T](p) //10
    val pfn = expMovingAvg |>  //11
    override def execute(x: Vector[T]): Try[DblVector] = 
      pfn(x).map(_.toVector)
  }

   class DFTFilter[T <: AnyVal]( 
      fc: Double)
    (g: (Double,Double) =>Double) 
     (implicit f : T => Double)
   extends Preprocessor[T] { //12

     val filter = filtering.DFTFir[T](g, fc, 1e-5)
     val pfn = filter |>
     override def execute(x: Vector[T]): Try[DblVector]=
        pfn(x).map(_.toVector)
   }
}

The generic preprocessor trait, Preprocessor, declares a single execute method whose purpose is to filter an x input vector of an element of a T type for noise (line 7). The instance of the preprocessor is declared as an abstract class to be instantiated as one of the filtering algorithms (line 8).

The first filtering algorithm of an ExpMovingAverage type implements the Preprocessor trait and overrides the execute method (line 9). The class declares the algorithm but delegates its implementation to a class with an identical org.scalaml.filtering.ExpMovingAverage signature (line 10). The partial function returned from the |> method is instantiated as a pfn value, so it can be applied multiple times (line 11). The same design pattern is used for the discrete Fourier transform filter (line 12).

The filtering algorithm (ExpMovingAverage or DFTFir) is selected according to the profile or characteristic of the input data. Its implementation in the org.scalaml.filtering package depends on the environment (a single host, cluster, Apache spark, and so on).

Note

Filtering algorithms

The filtering algorithms used to illustrate the concept of modularization in the context of the Cake pattern are described in detail in Chapter 3, Data Preprocessing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset