Machine learning algorithms make significant use of linear algebra and optimization techniques. Describing the concept and the implementation of linear algebra, calculus, and optimization algorithms in detail would have added significant complexity to the book and distracted the reader from the essence of machine learning.
The appendix lists a basic set of elements of linear algebra and optimization mentioned throughout the book. It also summarizes the coding practices and acquaints the reader with basic knowledge of financial analysis.
Here is a partial list of coding practices and design techniques used throughout the book.
The precompiled Scala for Machine Learning code is ScalaMl-2.11-0.99.jar
located in the $ROOT/project/target/scala-2.11
directory. Not all the libraries are needed for every chapter. The list is as follows:
The lib
directory contains the following JAR files related to the third-party libraries or frameworks used in the book: colt, CRF, LBFGS and LIBSVM.
For the sake of readability of the implementation of algorithms, all nonessential code such as error checking, comments, exception, or import have been omitted. The following code elements are discarded in the code snippets presented in the book:
/** This class is defined as … */ // The MathRuntime exception has to be caught here!
class BaumWelchEM(val lambda: HMMLambda ...) {
require( lambda != null, "Lambda model is undefined")
final
and private
:final protected class MLP[T <% Double] …
final
, private
, and so on):final def inputLayer: MLPLayer private def recurse: Unit =
class Config extends Serializable { … }
val pfn: PartialFunction[U, V]
pfn.isDefinedAt(u)
assert( p != None, " … ")
try { … } catch { case e: ArrayIndexOutOfBoundsException => … } if (y < EPS) throw new IllegalStateException( … )
Try(process(args)) match { case Success(results) => … case Failure(e) => … }
@inline def mean = { … } @implicitNotFound("Conversion $T to Array[Int] undefined") @throws(classOf[IllegalStateException)
m_logger.debug( …) Console.println( … )
One important objective while creating an API is to reduce the access to support a helper class. There are two options to encapsulate helper classes, as follows:
The algorithms presented in this book follow the first encapsulation pattern.
The constructors of a class are defined in the companion object using apply
and the class has a package scope (protected
):
protected class A[T](val x: X, val y: Y,…) { … } object A { def apply[T](x: X, y: Y, ...): A[T] = new A(x, y,…) def apply[T](x: , ..): A[T] = new A(x, y0, …) }
For example, the SVM
class that implements the support vector machine is defined as follows:
final protected class SVM[T <: AnyVal]( config: SVMConfig, xt: XVSeries[T], expected: DblVector)(implicit f: T => Double) extends ITransform[Array[T]](xt) {
The SVM
companion object is responsible for defining all the constructors (instance factories) relevant to the SVM
protected class:
def apply[T <: AnyVal]( config: SVMConfig, xt: XVSeries[T], expected: DblVector)(implicit f: T => Double): SVM[T] = new SVM[T](config, xt, expected)
In the preceding example, the constructors are explicitly defined in the companion object. Although the invocation of the constructor is very similar to the instantiation of case classes, there is a major difference; the Scala compiler generates several methods to manipulate an instance as regular data (equals, copy, hash, and so on).
Case classes should be reserved for single state data objects (no methods).
It is quite common to read or hear discussions regarding the relative merit of enumerations and pattern matching with case classes in Scala [A:1]. As a very general guideline, enumeration values can be regarded as lightweight case classes or case classes can be considered as heavy weight enumeration values.
Let's take an example of a Scala enumeration that consists of evaluating the uniform distribution of the scala.util.Random
library:
object A extends Enumeration { type TA = Value val A, B, C = Value } import A._ val counters = Array.fill(A.maxId+1)(0) Range(0, 1000).foreach( _ => Random.nextInt(10) match { case 3 => counters(A.id) += 1 … case _ => { } })
The pattern matching is very similar to the Java's switch
statement.
Let's consider the following example of pattern matching using case classes that selects a mathematical formula according to the input:
package AA { sealed abstract class A(val level: Int) case class AA extends A(3) { def f =(x:Double) => 23*x} … } import AA._ def compute(a: A, x: Double): Double = a match { case a: A => a.f(x) … }
The pattern matching is performed using the default equals method, whose byte code is automatically set for each case class. This approach is far more flexible than the simple enumeration at the cost of extra computation cycles.
The advantages of using enumerations over case classes are as follows:
The advantages of using case classes are as follows:
In short, you should use enumeration for single value constants and case classes to match data objects.
Contrary to C++, Scala does not actually overload operators. Here is the definition of the very few operators used in code snippets:
The machine learning algorithms described in this book uses the following design pattern and components:
Config
(that is, SVMConfig
).ITransform
type for which the model is implicitly generated from a training set (that is, SVM[T]
). The classifier requires at least three parameters, which are as follows:xt
, of the Vector[T]
typeexpected
valuesModel
. The constructor is responsible for creating the model through training (that is, SVMModel
).Let's take a look at the following diagram:
For example, the key components of the support vector machine package are the classifier SVMs:
final protected class SVM[T <: AnyVal]( val config: SVMConfig, val xt: XTSeries[Array[T]], val labels: DblVector)(implicit f: T => Double) extends ITransform[Array[T]](xt) with Monitor[Double] { type V = val model: Option[SVMModel] = { … } override def |> PartialFunction[Array[T], V] … }
The training set is created by combining or zipping the input dataset xt
with the labels or expected values expected
. Once trained and validated, the model is available for prediction or classification.
This design has the main advantage of reducing the life cycle of a classifier: a model is either defined, available for classification, or is not created.
The configuration and model classes are implemented as follows:
final class SVMConfig(val formulation: SVMFormulation, val kernel: SVMKernel, val svmExec: SVMExecution) extends Config class SVMModel(val svmmodel: svm_model) extends Model
A CSV file is the most common format used to store historical financial data. It is the default format used to import data throughout the book. The data source relies on a DataSourceConfig
configuration class, as follows:
case class DataSourceConfig(pathName: String, normalize: Boolean,
reverseOrder: Boolean, headerLines: Int = 1)
The parameters of the DataSourceConfig
class are as follows:
pathName
: This is the relative pathname of a data file to be loaded if the argument is a file or the directory containing multiple input data files. Most of files are CSV files.normalize
: This is the flag that is used to specify whether the data has to be normalized over [0, 1].reverseOrder
: This is the flag that is used to specify whether the order of the data in the file has to be reversed (for example, a time series) if its value is true
.headerLines
: This specifies the number of lines for the column headers and comments.The data source DataSource
implements data transformation of the ETransform
type using an explicit configuration DataSourceConfig
, as described in the Monadic data transformation section in Chapter 2, Hello World!:
final class DataSource(config: DataSourceConfig, srcFilter: Option[Fields => Boolean]= None) extends ETransform[DataSourceConfig](config) { type Fields = Array[String] type U = List[Fields => Double] type V = XVSeries[Double] override def |> : PartialFunction[U, Try[V]] ... }
The srcFilter
argument specifies the filter or condition of some of the row fields to skip the dataset (that is, missing data or incorrect format). Being an explicit data transformation, the constructor for the DataSource
class has to initialize the U
input type and the V
output type of the |>
extracting method. The method takes the extractor from a row of literal values to double floating point values:
override def |> : PartialFunction[U, Try[V]] = { case fields: U if(!fields.isEmpty) =>load.map(data =>{ //1 val convert = (f: Fields =>Double) => data._2.map(f(_)) if( config.normalize) //2 fields.map(t => new MinMax[Double](convert(t)) //3 .normalize(0.0, 1.0).toArray ).toVector //4 else fields.map(convert(_)).toVector }) }
The data is loaded from the file using the load
helper method (line 1
). The data is normalized if required (line 2
) by converting each literal to a floating point value using an instance of the MinMax
class (line 3
). Finally, the MinMax
instance normalizes the sequence of floating point values (line 4
).
The DataSource
class implements a significant set of methods that are documented in the source code available online.
The examples in the book rely on three different sources of financial data using the CSV format:
Let's illustrate the extraction from a data source using YahooFinancials
as an example:
object YahooFinancials extends Enumeration { type YahooFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value val adjClose = ((s:Array[String]) => s(ADJ_CLOSE.id).toDouble) //5 val volume = (s: Fields) => s(VOLUME.id).toDouble … def toDouble(value: Value): Array[String] => Double = (s: Array[String]) => s(value.id).toDouble }
Let's take a look at an example of an application of a DataSource
transformation: loading the historical stock data from the Yahoo finance site. The data is downloaded as a CSV formatted file. Each column is associated with an extractor function (line 5
):
val symbols = Array[String]("CSCO", ...) //6 val prices = symbols .map(s => DataSource(s"$path$s.csv",true,true,1))//7 .map( _ |> adjClose ) //8
The list of stocks for which the historical data has to be downloaded is defined as an array of symbols (line 6
). Each symbol is associated with a CSV file (that is, CSCO => resources/CSCO.csv
) (line 7
). Finally, the YahooFinancials
extractor for the adjClose
price is invoked (line 8
).
The format for the financial data extracted from the Google financial pages are similar to the format used in the Yahoo finances pages:
object GoogleFinancials extends Enumeration { type GoogleFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME = Value val close = ((s:Array[String]) =>s(CLOSE.id).toDouble)//5 … }
The YahooFinancials
, YahooFinancials
, and Fundamentals
classes implement a significant number of methods that are documented in the source code available online.
The DocumentsSource
class is responsible for extracting the date, title, and content of a list of text documents or text files. The class does not support HTML documents. The DocumentsSource
class implements a monadic data transformation of the ETransform
type with an explicit configuration of the SimpleDataFormat
type:
class DocumentsSource(dateFormat: SimpleDateFormat, val pathName: String) extends ETransform[SimpleDateFormat](dateFormat) { type U = Option[Long] //2 type V = Corpus[Long] //3 override def |> : PartialFunction[U, Try[V]] = { //4 case date: U if (filesList != None) => Try( if(date == None ) getAll else get(date) ) } def get(t: U): V = getAll.filter( _.date == t.get) def getAll: V //5 ... }
The DocumentsSource
class takes two arguments: the format of the date associated with the document and the name of the path in which the documents are located (line 1
). Being an explicit data transformation, the constructor of the DocumentsSource
class has to initialize the U
input type (line 2
) as a date and convert it into a Long
and V
output type (line 3
) as a Corpus
to extract the |>
method.
The |>
extractor generates a corpus associated with a specific date and converts it into a Long
type (line 4
). The getAll
method does the heavy lifting to extract or sort documents (line 5
).
The implementation of the getAll
method as well as other methods of the DocumentsSource
class are described in the documented source code available online.
Some discriminative learning models require operations to be performed on rows and columns of a matrix. The DMatrix
class facilitates the read and write operations on columns and rows:
class DMatrix(val nRows: Int, val nCols: Int, val data: DblArray) { def apply(i: Int, j: Int): Double = data(i*nCols+j) def row(iRow: Int): DblArray = { val idx = iRow*nCols data.slice(idx, idx + nCols) } def col(iCol: Int): IndexedSeq[Double] = (iCol until data.size by nCols).map( data(_) ) def diagonal: IndexedSeq[Double] = (0 until data.size by nCols+1).map( data(_)) def trace: Double = diagonal.sum … }
The apply
method returns an element of the matrix. The row
method returns a row array, and the col
method returns the indexed sequence of column elements. The diagonal
method returns the indexed sequence of diagonal elements, and the trace
method sums the diagonal elements.
The DMatrix
class supports normalization of elements, rows, and columns; transposition; and updation of elements, columns and rows. The DMatrix
class implements a significant number of methods that are documented in the source code available online.
The Counter
class implements a generic mutable counter for which the key is a parameterized type. The number of occurrences of a key is managed by a mutable hash map:
class Counter[T] extends mutable.HashMap[T, Int] { def += (t: T): type.Counter = super.put(t, getOrElse(t, 0)+1) def + (t: T): Counter[T] = { super.put(t, getOrElse(t, 0)+1); this } def ++ (cnt: Counter[T]): type.Counter = { cnt./:(this)((c, t) => c + t._1); this } def / (cnt: Counter[T]): mutable.HashMap[T, Double] = map { case(str, n) => (str, if( !cnt.contains(str) ) throw new IllegalStateException(" ... ") else n.toDouble/cnt.get(str).get ) } … }
The +=
operator updates the counter of the t
key and returns itself. The +
operator updates and then duplicates the updated counters. The ++
operator updates this counter with another counter. The /
operator divides the count for each key by the counts of another counter.
The Counter
class implements a significant set of methods that are documented in the source code available online.
The Monitor
class has two purposes:
show
and error
methodsThe data is collected at each iteration or recursion and then displayed as a time series with iterations as x axis values:
trait Monitor[T] { protected val logger: Logger lazy val _counters = new mutable.HashMap[String, mutable.ArrayBuffer[T]]() def counters(key: String): Option[mutable.ArrayBuffer[T]] def count(key: String, value: T): Unit def display(key: String, legend: Legend) (implicit f: T => Double): Boolean def show(msg: String): Int = DisplayUtils.show(msg, logger) def error(msg: String): Int = DisplayUtils.error(msg, logger) ... }
The counters
method returns an array associated with a specific key. The count
method updates the data associated with a key. The display
method plots the time series. Finally, the show
and error
methods send information and error messages to the standard output.
The documented source code for the implementation of the Monitor
class is available online.