Machine learning algorithms make significant use of linear algebra and optimization techniques. Describing the concepts and the implementation of linear algebra, calculus, and optimization algorithms in detail would have added significant complexity to the book and distracted the reader from the essence of machine learning.
This appendix lists a set of basic elements of linear algebra and optimization mentioned throughout the book. It also summarizes the coding practices that have been covered, and acquaints the reader with basic knowledge of financial analysis.
The following is a partial list of coding practices and design techniques used throughout the book.
The libraries
directory contains the JAR files related to the third-party libraries or frameworks used in this book. Not all libraries are needed for every chapter. The list is as follows:
For the sake of readability of the implementation of algorithms, all non-essential pieces of code such as error checking, comments, exceptions, or imports have been omitted. The following code elements have been discarded in the code snippets presented in the book:
// The MathRuntime exception has to be caught here!
class BaumWelchEM(val lambda: HMMLambda ...) {
require( lambda != null, "Lambda model is undefined")
final protected class MLP[T <% Double] …
final def inputLayer: MLPLayer private def recurse: Unit =
try { … } catch { case e: ArrayIndexOutOfBoundsException => … } if (y < EPS) throw new IllegalStateException( … )
Try(process(args)) match { case Success(results) => … case Failure(e) => … }
@inline def mean = { … }
m_logger.debug( …) Console.println( … )
One important objective while creating an API is reducing access to supporting or helper classes. There are two options to encapsulate helper classes, as follows:
The algorithms presented in this book follow the first encapsulation pattern.
The constructors of a class are defined in the companion object using apply
and the class has package scope (protected
):
protected class MyClass[T](val x: X, val y: Y,…) { … } object MyClass { def apply[T](x: X, y:Y, ..): MyClass[T] = new MyClass(x,y,..) final val y0 = .. def apply[T](x: , ..): MyClass[T] = new MyClass(x, y0, …) }
For example, the configuration of the support vector machine classifier is defined as follows:
protected class SVMConfig(val formulation: SVMFormulation, val kernel: SVMKernel, val svmExec: SVMExecution) extends Config
Its constructors are defined as follows:
object SVMConfig { val DEFAULT_CACHE = 25000 val DEFAULT_EPS = 1e-15 … def apply(svmType: SVMFormulation, kernel: SVMKernel, svmExec: SVMExecution): SVMConfig = new SVMConfig(svmType, kernel, svmExec) def apply(svmType: SVMFormulation, kernel: SVMKernel): SVMConfig = new SVMConfig(svmType, kernel, new SVMExecution(DEFAULT_CACHE, DEFAULT_EPS, -1)) }
In the preceding example, the constructors are explicitly defined in the companion object. Although the invocation of the constructor is very similar to the instantiation of case classes, there is a major difference—the Scala compiler generates several methods to manipulate an instance as regular data (equals, copy, hash, and so on).
Case classes should be reserved for single-state data objects, that is, objects with no methods.
It is not uncommon to read or hear discussions regarding the relative merit of enumerations and pattern matching with case classes in Scala [A:1]. As a very general guideline, enumeration values can be regarded as lightweight case classes or case classes can be considered as heavyweight enumeration values.
Let's take an example of a Scala enumeration that consists of evaluating the uniform distribution of scala.util.Random
:
object MyEnum extends Enumeration { type TMyEnum = Value val A, B, C = Value } import MyEnum._ val counters = Array.fill(MyEnum.maxId+1)(0) Range(0, 1000).foreach( _ => Random.nextInt(10) match { case 3 => counters(A.id) += 1 … case _ => { } })
The previous pattern matching is very similar to the switch
statement of Java.
Let's consider the following example of pattern matching using case classes that selects a mathematical formula according to the input:
package MyPackage { sealed abstract class MyEnum(val level: Int) case class A extends MyEnum(3) { def f =(x:Double) => 23*x} … } import MyPackage._ def compute(myEnum: MyEnum, x: Double): Double = myEnum match { case a: A => a.f(x) … }
The previous pattern matching is performed using the default equals method, whose byte code is automatically set for each case class. This approach is far more flexible than simple enumeration, at the cost of extra computation cycles.
The advantages of using enumerations over case classes are as follows:
The advantages of using case classes are as follows:
In a nutshell, you should use enumeration for single value constants and case classes to match data objects.
Contrary to C++, Scala does not actually overload operators. Here is the meaning of the operators used in code snippets:
The machine learning algorithms described in this book use the following design pattern:
Model
trait is created through training during the initialization of the classifierConfig
traitPipeOperator
traitHave a look at the following diagram:
For example, the key components of the support vector machine package are as follows:
final protected class SVM[T <% Double](val config: SVMConfig, val xt: XTSeries[Array[T]], val labels: DblVector) extends PipeOperator[Array[T], Double] { val model: Option[SVMModel] = { … } override def |> (x: Feature): Option[Double] = { prediction } … } final protected class SVMConfig(val formulation: SVMFormulation, val kernel: SVMKernel, val svmExec: SVMExecution) extends Config protected class SVMModel(val params: (svm_model, Double)) extends Model
The two data inputs required to train a model are the configuration of the classifier (config
) and the training set (xt
and labels
). Once trained and validated, the model is available for prediction or classification.
This design has the main advantage of reducing the life cycle of the classifier; a model is either defined, available for classification, or is not created.
A CSV file is the most common format used to store historical financial data. It is the default format used to import data throughout this book:
type Fields = Array[String] class DataSource(pathName: String, normalize: Boolean, reverseOrder: Boolean, headerLines: Int = 1, srcFilter: Option[Fields=>Boolean]) extends PipeOperator[List[Fields =>Double], List[DblVector]]
The parameters for the DataSource
class are as follows:
pathName
: This is the relative pathname of a data file to be loaded if the argument is a file, or the directory containing multiple input data files. Most of the files are CSV files.normalize
: This is a flag to specify if the data has to be normalized over [0, 1].reverseOrder
: This is a flag to specify whether the order of the data in the file has to be reversed—for example, time series—if its value is true
.headerLines
: This specifies the number of lines for column headers and comments.srcFilter
: This is a filter or condition for some of the row fields to skip the data set, for example, missing data or incorrect format.The most important method of DataSource
is the following data transformation from a file to a typed time series (XTSeries[T]
) implemented as the pipe operator method. The method takes the extractor from a row of literal values to Double
floating-point values:
def |> : PartialFunction[List[Fields=>Double],List[DblVector]] ={ case extr: List[Fields=>Double] if(extr!=null && extr.size>0)=> load match { //1 case Some(data) => { if( normalize) // 2 extr.map(t=>Stats[Double](data._2.map(t(_))) //3 .normalize) //4 else extr.map(t => data._2.map(t(_))) } … }
The data is loaded from the file and converted into a list of vectors using the extractor, extr
(line 1
). The data is normalized if required (line 2
) by converting each literal to a floating point value and a Stats
object is created (line 3
). Finally, the Stats
instance normalizes the sequence of floating-point values (line 4
).
A second data transformation consists of transforming a single literal per row to create a time series of single variables:
def |> (extr: Fields => Double): Option[XTSeries[Double]]
The examples in this book rely on three different sources of financial data using CSV format:
YahooFinancials
for Yahoo schema for historical stock and ETF priceGoogleFinancials
for Google schema for historical stock and ETF priceFundamentals
for fundamental financial analysis ratio (CSV file)Let's illustrate the extraction from a data source using YahooFinancials
as an example:
object YahooFinancials extends Enumeration { type YahooFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value val adjClose = ((s: Fields) => s(ADJ_CLOSE.id).toDouble) … def toDouble(v: Value): Fields => Double = (s: Fields) => s(v.id).toDouble def vol: Fields => Double = (s: Fields) => { s(HIGH.id).toDouble/s(LOW.id).toDouble -1.0) * s(VOLUME.id).toDouble) } … }
Let's look at an example of application of a DataSource
transformation: loading historical stock data from the Google finance website. The data is downloaded as a CSV-formatted file. The first step is to specify the column name using an enumeration singleton, YahooFinancials
:
object GoogleFinancials extends Enumeration { type GoogleFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME = Value val close = ((s: Fields) => s(CLOSE.id).toDouble)//5 … }
Each column is associated with an extractor function (line 5
). Consider the following code:
val symbols = Array[String]("CSCO", ...) //6
val prices = symbols
.map(s => DataSource(path+s+".csv",true,true,1))//7
.map( _ |> YahooFinancials.close ) //8
The list of stocks for which the historical data has to be downloaded is defined as an array of symbols (line 6
). Each symbol is associated with a CSV file (for example, CSCO
is associated with resources/CSCO.csv
) (line 7
). Finally, the YahooFinancials
extractor for the close price is invoked (line 8
).
The DocumentsSource
class is responsible for extracting the date, title, and content of a list of text documents or text files. This class does not support HTML documents:
class DocumentsSource(val pathName: String)
The extraction of terms is performed by the data transformation |>
, as follows:
def |> : Corpus = { filesList.map( fName => { val src = Source.fromFile(pathName + fName) //1 val fieldIter = src.getLines //2 val date = nextField(fieldIter) val title = nextField (fieldIter) val content = fieldIter.foldLeft(new StringBuilder)((b, str) => b.append(str.trim)) //3 src.close //4 if(date == None || title == None) throw new IllegalStateException( … ) //6 (date.get, title.get, content.toString) //5 }) }
This method loads the text files for each filename in the list, filesList
(line 1
). It gets a reference to the document lines iterator, fieldIter
(line 2
). The iterator is used to extract (line 3
) and return the tuple (document date, document title, document content) (line 5
) once the file handle is closed (line 4
). An IllegalStateException
is thrown and caught if the text file is malformed. The nextField
method moves the iterator forward to the next non-null line:
def nextField(iter: Iterator[String]): Option[String] =
iter.find(s=> (s != null && s.length > 1)
Some discriminative learning models require operations performed on rows and columns of the matrix. The parameterized Matrix
class facilitates the read/write operations on columns and rows:
class Matrix[@specialized(Double, Int) T: ClassTag](val nRows: Int, val nCols: Int, val data:Array[T])(implicit f: T => Double){ def apply(i: Int, j: Int): T = data(i*nCols+j) def cols(i: Int): Array[T] = { (i until data.size by nCols) .map(data(_)).toArray } ... def += (i: Int, j : Int, t: T): Unit = data(i*nCols +j) = t def += (iRow: Int, t: T): Unit = { val i = iRow*nCols Range(0, nCols).foreach(k => data(i + k) =t) } def /= (iRow: Int, t: T)(implicit g: Double => T): Unit = { val i = iRow*nCols Range(0, nCols).foreach(k => data(i + k) /= t) } }
The apply
method returns an element. Similarly, the cols
method returns a column. The write methods consist of updating an element or a column of elements (+=
) with a value and dividing the elements of a column by a value (/=
). The matrix is specialized with the type Double
in order to generate a dedicated byte code for this type.
The generation of the transpose matrix is performed by the transpose
method. It is an alternative to the Scala methods Array.transpose
and List.transpose
:
def transpose: Matrix[T] = {
val m = Matrix[T](nCols, nRows)
Range(0, nRows).foreach(i => {
val col = i*nCols
Range(0, nCols).foreach(j => m += (j, i, data(col+j)))
})
m
}
The constructors of the Matrix
class are defined by its companion object:
def apply[T: ClassTag](nR: Int, nC: Int, data: Array[T])
(implicit f: T => Double): Matrix[T] =
new Matrix(nRows, nCols, data)