In the first chapter, we briefly introduced the binary logistic regression (the binomial logistic regression for a single variable) as our first test case. The purpose was to illustrate the concept of discriminative classification. There are many more regression models, starting with the ubiquitous ordinary least square linear regression and the logistic regression [6:1].
The purpose of regression is to minimize a loss function, with the residual sum of squares (RSS) being one that is commonly used. The problem of overfitting described in the Overfitting section under Assessing a model in Chapter 2, Hello World!, can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization.
The first section of this chapter will describe and implement the linear least-squares regression. The second section will introduce the concept of regularization with an implementation of the ridge regression. Finally, the logistic regression will be revisited in detail from the perspective of a classification model.
Linear regression is by far the most widely used, or at least the most commonly known, regression method. The terminology is usually associated with the concept of fitting a model to data and minimizing the errors between the expected and predicted values by computing the sum of square errors, residual sum of square errors, or least-square errors.
The least squares problems fall into the following two categories:
Let's start with the simplest form of linear regression, which is the single variable regression, in order to introduce the terms and concepts behind linear regression. In its simplest interpretation, the one-variate linear regression consists of fitting a line to a set of data points {x, y}.
The RSS is also known as the sum of squared errors (SSE). The mean squared error (MSE) for n observations is defined as the ratio RSS/n.
Terminology
The terminology used in the scientific literature regarding regression is a bit confusing at times. Regression weights are also known as regression coefficients or regression parameters. The weights are referred to as w in formulas and the source code throughout the chapter, although β is also used in reference books.
Let's create a SingleLinearRegression
parameterized class to implement the M1 formula. The linear regression is a data transformation that uses a model implicitly derived or built from data. Therefore, the simple linear regression implements the ITransform
trait, as described in the Monadic data transformation section in Chapter 2, Hello World!
The SingleLinearRegression
class takes the following two arguments:
xt
vector of single variable observationsexpected
values or labels (line 1
)The code will be as follows:
class SingleLinearRegression[T <: AnyVal]( xt: XSeries[T], expected: Vector[T])(implicit f: T => Double) extends ITransform[T](xt) with Monitor[Double] { //1 type V = Double //2 val model: Option[DblPair] = train //3 def train: Option[DblPair] override def |> : PartialFunction[T, Try[V]] def slope: Option[Double] = model.map(_._1) def intercept: Option[Double] = model.map(_._2) }
The Monitor
trait is used to collect the profiling information during training (refer to the Monitor section under Utility classes in the Appendix A, Basic Concepts).
The class has to define the type of the output of the |>
prediction method, which is a Double
(line 2
).
Model instantiation
The model parameters are computed through training and the model is instantiated regardless of whether the model is actually validated. A commercial application requires the model to be validated using a methodology such as the K-fold validation, as described in the Design template for immutable classifiers section in the Appendix A, Basic Concepts.
The training generates the model defined as the regression weights (slope and intercept) (line 3
). The model is set as None
if an exception is thrown during training:
def train: Option[DblPair] = { val regr = new SimpleRegression(true) //4 regr.addData(zipToSeries(xt, expected).toArray) //5 Some((regr.getSlope, regr.getIntercept)) //6 }
The regression weights or coefficients, that is the model
tuple, are computed using the SimpleRegression
class from the stats.regression
package of the Apache Commons Math library with the true
argument to trigger the computation of the intercept (line 4
). The input time series and the labels (or expected values) are zipped to generate an array of two values (input and expected) (line 5
). The model
is initialized with the slope and intercept computed during the training (line 6
).
The zipToSeries
method of the XTSeries
object is described in the Time series in Scala section in Chapter 3, Data Preprocessing.
For our first test case, we compute the single variate linear regression of the price of the Copper ETF (ticker symbol: CU) over a period of 6 months (January 1, 2013 to June 30, 2013):
val path = "resources/data/chap6/CU.csv" for { price <- DataSource(path, false, true, 1) get adjClose //7 days <- Try(Vector.tabulate(price.size)(_.toDouble)) //8 linRegr <- SingleLinearRegression[Double](days, price) //9 } yield { if( linRegr.isModel ) { val slope = linRegr.slope.get val intercept = linRegr.intercept.get val error = mse(days, price, slope, intercept)//10 } … }
The daily closing price
of the ETF CU is extracted from a CSV file (line 7
) as the expected values using a DataSource
instance, as described in the Data extraction and Data sources section in the Appendix A, Basic Concepts. The x values days
are automatically generated as a linear function (line 8
). The expected values (price
) and sessions (days
) are the inputs to the instantiation of the simple linear regression (line 9
).
Once the model is created successfully, the test code computes the mse
mean squared error of the predicted and expected values (line 10
):
def mse(
predicted: DblVector,
expected: DblVector,
slope: Double,
intercept: Double): Double = {
val predicted = xt.map( slope*_ + intercept)
XTSeries.mse(predicted, expected) //11
}
The mean least squared error is computed using the mse
method of XTSeries
(line 11
). The original stock price and linear regression equation are plotted on the following chart:
The total least square error is 0.926.
Although the single variable linear regression is convenient, it is limited to a scalar time series. Let's consider the case of multiple variables.
The ordinary least squares regression computes the parameters w of a linear function y = f(x0, x2 … xd) by minimizing the residual sum of squares. The optimization problem is solved by performing vector and matrix operations (transposition, inversion, and substitution).
There are several methodologies to minimize the residual sum of squares for a linear regression:
An overview of these matrix decompositions and optimization techniques can be found in the Linear algebra and Summary of optimization techniques sections in the Appendix A, Basic Concepts.
The QR decomposition generates the smallest relative error MSE for the most common least squares problem. The technique is used in our implementation of the least squares regression.
The implementation of the least squares regression leverages the Apache Commons Math library implementation of the ordinary least squares regression [6:6].
This chapter describes several types of regression algorithms. It makes sense to define a generic Regression
trait that defines the key element component of a regression algorithm.
RegressionModel
type (line 1
)weights
and rss
(line 2
and 3
)train
polymorphic method that implements the training of this specific regression algorithm (line 4
)training
protected method that wraps train
into a Try
monadThe code will be as follows:
trait Regression { val model: Option[RegressionModel] = training //1 def weights: Option[DblArray] = model.map( _.weights)//2 def rss: Option[Double] = model.map(_.rss) //3 def isModel: Boolean = model != None protected def train: RegressionModel //4 def training: Option[RegressionModel] = Try(train) match { case Success(_model) => Some(_model) case Failure(e) => e match { case err: MatchError => { … } case _ => { … } } } }
The model is simply defined by its weights
and its residual sum of squares (line 5
):
case class RegressionModel( //5 val weights: DblArray, val rss: Double) extends Model object RegressionModel { def dot[T <: AnyVal](x: Array[T], w: DblArray)(implicit f: T => Double): Double = x.zip(weights.drop(1)) .map{ case(_x, w) => _x*w}.sum + weights.head //6 }
The RegressionModel
companion object implements the computation of the dot
inner product of the weights
regression and an observation, x
(line 6
). The dot
method is used throughout the chapter.
Let's create a MultiLinearRegression
class as a data transformation whose model is implicitly derived from the input data (training set), as described in the Monadic data transformation section in Chapter 2, Hello World!:
class MultiLinearRegression[T <: AnyVal]( xt: XVSeries[T], expected: DblVector)(implicit f: T => Double) extends ITransform[Array[T]](xt) with Regression with Monitor[Double] { //7 type V = Double //8 override def train: Option[RegressionModel] //9 override def |> : PartialFunction[Array[T], Try[V]] //10 }
The MultiLinearRegression
class takes two arguments: the multidimensional time series of the xt
observations and the vector of expected
values (line 7
). The class implements the ITransform
trait and needs to define the type of the output value for the prediction or regression, V
as a Double
(line 8
). The constructor for MultiLinearRegression
creates the model
through training (line 9
). The ITransform
trait's |>
method implements the runtime prediction for the multilinear regression (line 10
).
The Monitor
trait is used to collect the profiling information during training (refer to the Monitor section under Utility classes in the Appendix A, Basic Concepts).
The relationship between the different components of the multilinear regression is described in the following UML class diagram:
The UML diagram omits the helper traits and classes such as Monitor
or the Apache Commons Math components.
The training is performed during the instantiation of the MultiLinearRegression
class (refer to the Design template for immutable classifiers section in the Appendix A, Basic Concepts):
def train: RegressionModel = { val olsMlr = new MultiLinearRAdapter //11 olsMlr.createModel(expected, data) //12 RegressionModel(olsMlr.weights, olsMlr.rss) //13 }
The functionality of the ordinary least squares regression in the Apache Commons Math library is accessed through an olsMlr
reference to the MultiLinearRAdapter
adapter class (line 11
).
The train
method creates the model by invoking the OLSMultipleLinearRegression
Apache Commons Math class (line 12
) and returns the regression model (line 13
). The various methods of the class are accessed through the MultiLinearRAdapter
adapter class:
class MultiLinearRAdapter extends OLSMultipleLinearRegression {
def createModel(y: DblVector, x: Vector[DblArray]): Unit =
super.newSampleData(y.toArray, x.toArray)
def weights: DblArray = estimateRegressionParameters
def rss: Double = calculateResidualSumOfSquares
}
The createModel
, weights
, and rss
methods route the request to the corresponding methods in OLSMultipleLinearRegression
.
The Try{}
Scala exception handling monad is used as the return type for the train
method in order to catch the different types of exceptions thrown by the Apache Commons Math library such as MathIllegalArgumentException
, MathRuntimeException
, or OutOfRangeException
.
Exception handling
Wrapping up invocation of methods in a third party with a Try {}
Scala exception handler matters for a couple of reasons:
The predictive algorithm for the ordinary least squares regression is implemented by the |>
data transformation. The method predicts the output value, given a model and an input value, x
:
def |> : PartialFunction[Array[T], Try[V]] = {
case x: Array[T] if isModel &&
x.length == model.get.size-1
=> Try( dot(x, model.get) ) //14
}
The predictive value is computed using the dot
method defined in the RegressionModel
singleton, which was introduced earlier in this section (line 14
).
Trending consists of extracting the long-term movement in a time series. Trend lines are detected using a multivariate least squares regression. The objective of this first test is to evaluate the filtering capability of the ordinary least squares regression.
The regression is performed on the relative price variation of the Copper ETF (ticker symbol: CU). The selected features are volatility
and volume
, and the label or target variable is the price change between two consecutive y
trading sessions.
The naming convention for the trading data and metrics is described in the Trading data section under Technical analysis in the Appendix A, Basic Concepts.
The volume, volatility, and price variation for CU between January 1, 2013 and June 30, 2013 are plotted on the following chart:
Let's write the client code to compute the multivariate linear regression, price change = w0 + volatility.w1 + volume.w2:
import YahooFinancials._ val path = "resources/data/chap6/CU.csv" //15 val src = DataSource(path, true, true, 1) //16 for { price <- src.get(adjClose) //17 volatility <- src.get(volatility) //18 volume <- src.get(volume) //19 (features, expected) <- differentialData(volatility, volume, price, diffDouble) //20 regression <- MultiLinearRegression[Double]( features, expected) //21 } yield { if( regression.isModel ) { val trend = features.map(dot(_,regression.weights.get)) display(expected, trend) //22 } }
Let's take a look at the steps required for the execution of the test: it consists of collecting data, extracting the features and expected values, and training the multilinear regression model:
15
).DataSource
, for the trading session closing price
, the volatility
session, and the volume
session for the ETF CU (line 16
).17
), its volatility within a trading session (line 18
), and the trading volume during the session (line 19
) using the DataSource
transform.1
represents the increase in the price and 0
represents the decrease in the price (line 20
). The differentialData
generic method of the XTSeries
singleton is described in the Time series in Scala section in Chapter 3, Data Preprocessing.features
set and the expected
change in the daily ETF price (line 21
).22
).The time series of expected values and the data predicted by the regression are plotted on the following chart:
The least squares regression model is defined by the linear function for the estimation of price variation as follows:
price(t+1)-price(t) = -0.01 + 0.014 volatility – 0.0042.volume
The estimated price change (the dotted line in the preceding chart) represents the long-term trend from which the noise is filtered out. In other words, the least squares regression operates as a simple low-pass filter as an alternative to some of the filtering techniques such as the discrete Fourier transform or the Kalman filter, as described in Chapter 3, Data Preprocessing [6:7].
Although trend detection is an interesting application of the least squares regression, the method has limited filtering capabilities for time series [6:8]:
The second test case is related to feature selection. The objective is to discover which subset of initial features generates the most accurate regression model, that is, the model with the smallest residual sum of squares on the training set.
Let's consider an initial set of D features {xi}. The objective is to estimate the subset of features {xid} that are most relevant to the set of observations using a least squares regression. Each subset of features is associated with a fj(x|wj) model:
The ordinary least square regression is used to select the model parameters w in the case the feature set is small. Performing the regression of each subset of a large original feature set is not practical.
Let's consider the following four financial time series over the period from January 1, 2009 to December 31, 2013:
The problem is to estimate which combination of the S&P 500 index, gold price, and 10-year Treasury bond price variables is the most correlated to the exchange rate of the Yuan. For practical reasons, we use the Exchange Trade Funds CYN as the proxy for the Yuan/US dollar exchange rate (similarly, SPY, GLD, and TLT for the S&P 500 index, spot price of gold, and 10-year Treasury bond price, respectively).
The number of models to evaluate is relatively small, so an ad hoc approach to compute the RSS for each combination is acceptable. Let's take a look at the following graph:
The getRss
method implements the computation of the RSS value given a set of xt
observations, y
expected (smoothed) values, and featureLabels
labels for features and then returns a textual result:
def getRss( xt: Vector[DblArray], expected: DblVector, featureLabels: Array[String]): String = { val regression = new MultiLinearRAdapter[Double](xt,expected) //23 val modelStr = regression.weights.get.view .zipWithIndex.map{ case( w, n) => { val weights_str = format(w, emptyString, SHORT) if(n == 0 ) s"${featureLabels(n)} = $weights_str" else s"${weights_str}.${featureLabels(n)}" }}.mkString(" + ") s"model: $modelStr RSS =${regression.get.rss}" //24 }
The getRss
method merely trains the model by instantiating the multilinear regression class (line 23
). Once the regression model is trained during the instantiation of the MultiLinearRegression
class, the coefficients of the regression weights and the RSS values are stringized (line 24
). The getRss
method is invoked for any combination of the ETF, GLD, SPY, and TLT variables against the CNY label.
Let's take a look at the following test code:
val SMOOTHING_PERIOD: Int = 16 //25 val path = "resources/data/chap6/" val symbols = Array[String]("CNY", "GLD", "SPY", "TLT") //26 val movAvg = SimpleMovingAverage[Double](SMOOTHING_PERIOD) //27 for { pfnMovAve <- Try(movAvg |>) //28 smoothed <- filter(pfnMovAve) //29 models <- createModels(smoothed) //30 rsses <- Try(getModelsRss(models, smoothed)) //31 (mses, tss) <- totalSquaresError(models,smoothed.head) //32 } yield { s"""${rsses.mkString(" ")} ${mses.mkString(" ")} | Residual error= $tss".stripMargin }
The dataset is large (1,260 trading sessions) and noisy enough to warrant filtering using a simple moving average with a period of 16 trading sessions (line 25
). The purpose of the test is to evaluate the possible correlation between the four ETFs: CNY, GLD, SPY, and TLT (line 26
). The execution test instantiates the simple moving average (line 27
), as described in the The simple moving average section in Chapter 3, Data Preprocessing.
The workflow executes the following steps:
pfnMovAve
partial function (line 28
).filter
function (line 29
):Type PFNMOVAVE = PartialFunction[DblVector, Try[DblVector]] def filter(pfnMovAve: PFNMOVEAVE): Try[Array[DblVector]] = Try { symbols.map(s => DataSource(s"$path$s.csv", true, true, 1)) .map( _.get(adjClose) ) .map( pfnMovAve(_)).map(_.get)
createModels
method (line 30
):type Models = List[(Array[String], DblMatrix)] def createModels(smoothed: Array[DblVector]): Try[Models] = Try { val features = smoothed.drop(1).map(_.toArray) //33 List[(Array[String], DblMatrix)]( //34 (Array[String]("CNY","SPY","GLD","TLT"), features.transpose), (Array[String]("CNY","GLD","TLT"),features.drop(1).transpose), (Array[String]("CNY","SPY","GLD"),features.take(2).transpose), (Array[String]("CNY","SPY","TLT"), features.zipWithIndex .filter( _._2 != 1).map( _._1).transpose), (Array[String]("CNY","GLD"), features.slice(1,2).transpose) ) }
The smoothed values for CNY are used as the expected values. Therefore, they are removed from the features list (line 33
). The five models are evaluated by adding or removing elements from the features list (line 34
).
getModelsRss
(line 31
). The method invokes getRss
, which was introduced earlier in this section, for each model (line 35
):def getModelsRss( models: Models, y: Array[DblVector]): List[String] = models.map{ case (labels, m) => s"${getRss(m.toVector, y.head, labels)}" } //35
mses
mean squared errors for each model and the total squared errors (line 33
):def totalSquaresError( models: Models, expected: DblVector): Try[(List[String], Double)] = Try { val errors = models.map{case (labels,m) => rssSum(m, expected)._1}//36 val mses = models.zip(errors) .map{case(f,e) => s"MSE: ${f._1.mkString(" ")} = $e"} (mses, Math.sqrt(errors.sum)/models.size) //37 }
The totalSquaresError
method computes the error for each model by summing the RSS value, rssSum
, for each model (line 36
). The method returns a pair of an array of the mean squared error for each model and the total squared error (line 37
).
The RSS does not always provide an accurate visualization of the fitness of the regression model. The fitness of the regression model is commonly assessed using the r2 statistics. The r2 value is a number that indicates how well data fits into a statistical model.
The implementation of the computation of the r2 statistics is simple. For each model fj, the rssSum
method computes the tuple (rss and least squares errors), as defined in the M4 formula:
def rssSum(xt: DblMatrix, expected: DblVector): DblPair = { val regression = MultiLinearRegression[Double](xt, expected) //38 val pfnRegr = regression |> //39 val results = sse(expected.toArray, xt.map(pfnRegr(_).get)) (regression.rss, results) //40 }
The rssSum
method instantiates the MultiLinearRegression
class (line 38
), retrieves the RSS value, and then validates the pfnRegr
regressive model (line 39
) against the expected values (line 40
). The output of the test is presented in the following screenshot:
The output results clearly show that the three variable regression, CNY = f (SPY, GLD, TLT), is the most accurate or fittest model for the CNY time series, followed by CNY = f (SPY, TLT). Therefore, the feature selection process generates the features set, {SPY, GLD, TLT}.
Let's plot the model against the raw data:
The regression model smoothed the original CNY time series. It weeded out all but the most significant price variation. The graph plotting the r2 value for each of the model confirms that the three features model CNY=f (SPY, GLD, TLT) is the most accurate:
The general linear regression
The concept of a linear regression is not restricted to polynomial fitting models such as y = w0 + w1.x + w2.x2 + …+ wnxn. Regression models can be also defined as a linear combination of basis functions such as ϕj: y = w0 + w1.ϕ1(x) + w2ϕ2(x) + … + wn.ϕn(x) [6:9].