Appendix B. Basic Concepts

Machine learning algorithms make significant use of linear algebra and optimization techniques. Describing the concepts and the implementation of linear algebra, calculus, and optimization algorithms in detail would have added significant complexity to the book and distracted the reader from the essence of machine learning.

This appendix lists a set of basic elements of linear algebra and optimization mentioned throughout the book. It also summarizes the coding practices that have been covered, and acquaints the reader with basic knowledge of financial analysis.

Scala programming

The following is a partial list of coding practices and design techniques used throughout the book.

List of libraries

The libraries directory contains the JAR files related to the third-party libraries or frameworks used in this book. Not all libraries are needed for every chapter. The list is as follows:

  • Apache Commons Math 3.3 in Chapter 3, Data Preprocessing; Chapter 4, Unsupervised Learning; and Chapter 6, Regression and Regularization
  • JFChart 1.0.1 in Chapter 1, Getting Started; Chapter 2, Hello World!; Chapter 5, Naïve Bayes Classifiers; and Chapter 9, Artificial Neural Networks
  • Iitb CRF 0.2 (including L-BFGS and Colt libraries) in Chapter 7, Sequential Data Models
  • LIBSVM 0.1.6 in Chapter 8, Kernel Models and Support Vector Machines
  • Akka framework 2.2.4 in Chapter 12, Scalable Frameworks
  • Apache Spark/MLlib 1.1 in Chapter 12, Scalable Frameworks

Tip

Note for Spark developers

The Scala library and compiler JAR files bundled with the assembly JAR file of Apache Spark contain a version of the Scala standard library and compiler JAR file that may conflict with an existing Scala library (for example, Eclipse default ScalaIDE library).

Format of code snippets

For the sake of readability of the implementation of algorithms, all non-essential pieces of code such as error checking, comments, exceptions, or imports have been omitted. The following code elements have been discarded in the code snippets presented in the book:

  • Comments:
    // The MathRuntime exception has to be caught here!
  • Validation of class parameters and method arguments:
    class BaumWelchEM(val lambda: HMMLambda ...) {
    require( lambda != null, "Lambda model is undefined")
  • Class qualifiers such as final, private, and so on:
    final protected class MLP[T <% Double] …
  • Method qualifiers and access controls (final, private, and so on):
    final def inputLayer: MLPLayer
    private def recurse: Unit =
  • Java-style exceptions:
    try { … }
    catch { case e: ArrayIndexOutOfBoundsException  => … }
    if (y < EPS)
       throw new IllegalStateException( … )
  • Scala-style exceptions:
    Try(process(args)) match {
       case Success(results) => …
       case Failure(e) => …
    }
  • Non-essential annotations:
    @inline def mean = { … }
  • Logging and debugging code:
    m_logger.debug( …)
    Console.println( … )
  • Auxiliary methods not essential to the understanding of an algorithm

Encapsulation

One important objective while creating an API is reducing access to supporting or helper classes. There are two options to encapsulate helper classes, as follows:

  • Package scope: In this, the supporting classes are first-level classes with protected access
  • Class or object scope: In this, the supported classes are nested in the main class

The algorithms presented in this book follow the first encapsulation pattern.

Class constructor template

The constructors of a class are defined in the companion object using apply and the class has package scope (protected):

protected class MyClass[T](val x: X, val y: Y,…) { … } 
object MyClass {
  def apply[T](x: X, y:Y, ..): MyClass[T] = new MyClass(x,y,..)
  final val y0 = ..
  def apply[T](x: , ..): MyClass[T] = new MyClass(x, y0, …)
}

For example, the configuration of the support vector machine classifier is defined as follows:

protected class SVMConfig(val formulation: SVMFormulation, val kernel: SVMKernel, val svmExec: SVMExecution) extends Config

Its constructors are defined as follows:

object SVMConfig {
   val DEFAULT_CACHE = 25000
   val DEFAULT_EPS = 1e-15
   …
   def apply(svmType: SVMFormulation, kernel: SVMKernel, svmExec: SVMExecution): SVMConfig = new SVMConfig(svmType, kernel, svmExec)
   def apply(svmType: SVMFormulation, kernel: SVMKernel): SVMConfig = new SVMConfig(svmType, kernel, new SVMExecution(DEFAULT_CACHE, DEFAULT_EPS, -1))
}

Companion objects versus case classes

In the preceding example, the constructors are explicitly defined in the companion object. Although the invocation of the constructor is very similar to the instantiation of case classes, there is a major difference—the Scala compiler generates several methods to manipulate an instance as regular data (equals, copy, hash, and so on).

Case classes should be reserved for single-state data objects, that is, objects with no methods.

Enumerations versus case classes

It is not uncommon to read or hear discussions regarding the relative merit of enumerations and pattern matching with case classes in Scala [A:1]. As a very general guideline, enumeration values can be regarded as lightweight case classes or case classes can be considered as heavyweight enumeration values.

Let's take an example of a Scala enumeration that consists of evaluating the uniform distribution of scala.util.Random:

object MyEnum extends Enumeration {
  type TMyEnum = Value
  val A, B, C = Value
}

import MyEnum._
val counters = Array.fill(MyEnum.maxId+1)(0)
Range(0, 1000).foreach( _ => Random.nextInt(10) match {
  case 3 => counters(A.id) += 1
  …
  case _ => { }
})

The previous pattern matching is very similar to the switch statement of Java.

Let's consider the following example of pattern matching using case classes that selects a mathematical formula according to the input:

package MyPackage {
  sealed abstract class MyEnum(val level: Int)
  case class A extends MyEnum(3) { def f =(x:Double) => 23*x}
  …
}

import MyPackage._
def compute(myEnum: MyEnum, x: Double): Double = myEnum match {
   case a: A => a.f(x)
   …
}

The previous pattern matching is performed using the default equals method, whose byte code is automatically set for each case class. This approach is far more flexible than simple enumeration, at the cost of extra computation cycles.

The advantages of using enumerations over case classes are as follows:

  • Enumerations involve less code for a single attribute comparison
  • Enumerations are more readable, especially for Java developers

The advantages of using case classes are as follows:

  • Case classes are data objects and support more attributes than enumeration IDs
  • Pattern matching is optimized for sealed classes as the Scala compiler is aware of the number of cases

In a nutshell, you should use enumeration for single value constants and case classes to match data objects.

Overloading

Contrary to C++, Scala does not actually overload operators. Here is the meaning of the operators used in code snippets:

  • +=: This adds an element to a collection or container.
  • +: This sums two elements of the same type.
  • |>: This transforms a collection of data. It is also known as pipe operator. The type of output collections and elements can be different from that of the input.

Design template for classifiers

The machine learning algorithms described in this book use the following design pattern:

  • A model instance that implements the Model trait is created through training during the initialization of the classifier
  • All configuration parameters are encapsulated into a single configuration class inheriting the Config trait
  • The predictive or classification routine is implemented as a data transformation extending the PipeOperator trait
  • The classifier takes at least three parameters: configuration instance, a features set or time series, and a labeled dataset

Have a look at the following diagram:

Design template for classifiers

A generic UML class diagram for classifiers

For example, the key components of the support vector machine package are as follows:

final protected class SVM[T <% Double](val config: SVMConfig, val xt: XTSeries[Array[T]], val labels: DblVector) 
           extends PipeOperator[Array[T], Double] {
  val model: Option[SVMModel] = { … }
  override def |> (x: Feature): Option[Double] = { prediction }
  …
}

final protected class SVMConfig(val formulation: SVMFormulation, val kernel: SVMKernel, val svmExec: SVMExecution) extends Config
protected class SVMModel(val params: (svm_model, Double)) extends Model

The two data inputs required to train a model are the configuration of the classifier (config) and the training set (xt and labels). Once trained and validated, the model is available for prediction or classification.

This design has the main advantage of reducing the life cycle of the classifier; a model is either defined, available for classification, or is not created.

Note

Implementation considerations

The validation phase is omitted in most of the practical examples throughout this book for the sake of readability.

Data extraction

A CSV file is the most common format used to store historical financial data. It is the default format used to import data throughout this book:

type Fields = Array[String]
class DataSource(pathName: String, 
                normalize: Boolean, 
                reverseOrder: Boolean,
                
                headerLines: Int = 1, 
                srcFilter: Option[Fields=>Boolean]) 
  extends PipeOperator[List[Fields =>Double], List[DblVector]]

The parameters for the DataSource class are as follows:

  • pathName: This is the relative pathname of a data file to be loaded if the argument is a file, or the directory containing multiple input data files. Most of the files are CSV files.
  • normalize: This is a flag to specify if the data has to be normalized over [0, 1].
  • reverseOrder: This is a flag to specify whether the order of the data in the file has to be reversed—for example, time series—if its value is true.
  • headerLines: This specifies the number of lines for column headers and comments.
  • srcFilter: This is a filter or condition for some of the row fields to skip the data set, for example, missing data or incorrect format.

The most important method of DataSource is the following data transformation from a file to a typed time series (XTSeries[T]) implemented as the pipe operator method. The method takes the extractor from a row of literal values to Double floating-point values:

def |> : PartialFunction[List[Fields=>Double],List[DblVector]] ={
  case extr: List[Fields=>Double] if(extr!=null && extr.size>0)=> 
   load match {   //1
    case Some(data) => {
      if( normalize)  // 2
        extr.map(t=>Stats[Double](data._2.map(t(_))) //3
                                         .normalize) //4
      else extr.map(t => data._2.map(t(_)))
    }
    …
}

The data is loaded from the file and converted into a list of vectors using the extractor, extr (line 1). The data is normalized if required (line 2) by converting each literal to a floating point value and a Stats object is created (line 3). Finally, the Stats instance normalizes the sequence of floating-point values (line 4).

A second data transformation consists of transforming a single literal per row to create a time series of single variables:

def |> (extr: Fields => Double): Option[XTSeries[Double]]

Data sources

The examples in this book rely on three different sources of financial data using CSV format:

  • YahooFinancials for Yahoo schema for historical stock and ETF price
  • GoogleFinancials for Google schema for historical stock and ETF price
  • Fundamentals for fundamental financial analysis ratio (CSV file)

Let's illustrate the extraction from a data source using YahooFinancials as an example:

object YahooFinancials extends Enumeration {
   type YahooFinancials = Value
   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value
   val adjClose = ((s: Fields) => s(ADJ_CLOSE.id).toDouble)
   …
   def toDouble(v: Value): Fields => Double = 
               (s: Fields) => s(v.id).toDouble
   
   def vol: Fields => Double = (s: Fields) => {
     s(HIGH.id).toDouble/s(LOW.id).toDouble -1.0) * s(VOLUME.id).toDouble)
  }
   …
}

Let's look at an example of application of a DataSource transformation: loading historical stock data from the Google finance website. The data is downloaded as a CSV-formatted file. The first step is to specify the column name using an enumeration singleton, YahooFinancials:

object GoogleFinancials extends Enumeration {
   type GoogleFinancials = Value
   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME = Value
   val close = ((s: Fields) => s(CLOSE.id).toDouble)//5
   …
}

Each column is associated with an extractor function (line 5). Consider the following code:

val symbols = Array[String]("CSCO", ...)  //6
val prices = symbols
            .map(s => DataSource(path+s+".csv",true,true,1))//7
            .map( _ |> YahooFinancials.close ) //8

The list of stocks for which the historical data has to be downloaded is defined as an array of symbols (line 6). Each symbol is associated with a CSV file (for example, CSCO is associated with resources/CSCO.csv) (line 7). Finally, the YahooFinancials extractor for the close price is invoked (line 8).

Extraction of documents

The DocumentsSource class is responsible for extracting the date, title, and content of a list of text documents or text files. This class does not support HTML documents:

class DocumentsSource(val pathName: String)

The extraction of terms is performed by the data transformation |>, as follows:

def |> : Corpus = {
  
filesList.map( fName => {
    val src = Source.fromFile(pathName + fName) //1
    val fieldIter = src.getLines //2

    val date = nextField(fieldIter)
    val title = nextField (fieldIter)
    val content = fieldIter.foldLeft(new StringBuilder)((b, str) 
                          => b.append(str.trim)) //3
    src.close //4
    if(date == None || title == None)
       throw new IllegalStateException( … )  //6
    (date.get, title.get, content.toString) //5
  })
}

This method loads the text files for each filename in the list, filesList (line 1). It gets a reference to the document lines iterator, fieldIter (line 2). The iterator is used to extract (line 3) and return the tuple (document date, document title, document content) (line 5) once the file handle is closed (line 4). An IllegalStateException is thrown and caught if the text file is malformed. The nextField method moves the iterator forward to the next non-null line:

def nextField(iter: Iterator[String]): Option[String] = 
   iter.find(s=> (s != null && s.length > 1)

Matrix class

Some discriminative learning models require operations performed on rows and columns of the matrix. The parameterized Matrix class facilitates the read/write operations on columns and rows:

class Matrix[@specialized(Double, Int) T: ClassTag](val nRows: Int, val nCols: Int, val data:Array[T])(implicit f: T => Double){
   def apply(i: Int, j: Int): T = data(i*nCols+j)
   def cols(i: Int): Array[T] = { 
      (i until data.size by nCols)
            .map(data(_)).toArray
   }
   ...
   def += (i: Int, j : Int, t: T): Unit = data(i*nCols +j) = t
   def += (iRow: Int, t: T): Unit = {
     val i = iRow*nCols
     Range(0, nCols).foreach(k => data(i + k) =t)
   }
   def /= (iRow: Int, t: T)(implicit g: Double => T): Unit =  {
     val i = iRow*nCols
     Range(0, nCols).foreach(k => data(i + k) /= t)
   }
}

The apply method returns an element. Similarly, the cols method returns a column. The write methods consist of updating an element or a column of elements (+=) with a value and dividing the elements of a column by a value (/=). The matrix is specialized with the type Double in order to generate a dedicated byte code for this type.

The generation of the transpose matrix is performed by the transpose method. It is an alternative to the Scala methods Array.transpose and List.transpose:

def transpose: Matrix[T] = {
  val m = Matrix[T](nCols, nRows)
  Range(0, nRows).foreach(i => {
    val col = i*nCols
    Range(0, nCols).foreach(j => m += (j, i, data(col+j)))
  })
  m
}

The constructors of the Matrix class are defined by its companion object:

def apply[T: ClassTag](nR: Int, nC: Int, data: Array[T])
        (implicit f: T => Double): Matrix[T] = 
           new Matrix(nRows, nCols, data)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset