In the first chapter, we briefly introduced the binary logistic regression (binomial logistic regression for a single variable) as our first test case. The purpose was to illustrate the concept of discriminative classification. There are many more regression models, starting with the ubiquitous ordinary least-square linear regression and the logistic regression [6:1].
The purpose of regression is to minimize a loss function, with the residual sum of squares (RSS) being one that is commonly used. The problem of overfitting described in the Overfitting section of Chapter 2, Hello World!, can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization.
The first section of this chapter will describe and implement the linear least-squares regression. The second section will introduce the concept of regularization with an implementation of the Ridge regression.
Finally, the logistic regression will be revisited in detail from the perspective of a classification model.
Linear regression is by far the most widely used, or at least the most commonly known, regression method. The terminology is usually associated with the concept of fitting a model to data. Linear regression can be implemented using the least squares method. Practically, the least squares method entails the minimization of the sum of the squares of the error between the observed data and the actual model.
The least squares problems fall into two categories:
Let's start with the simplest form of linear regression, which is the single variable regression, in order to introduce the terms and concepts behind linear regression. In its simplest interpretation, the one-variate linear regression consists of fitting a line to a set of data points {x, y}.
The RSS is also known as the sum of squared errors (SSE). The mean squared error (MSE) for n observations is defined as the ratio RSS/n.
Terminology
The terminology used in the scientific literature regarding regression is a bit confusing at times. Regression weights are also known as regression coefficients or regression parameters. The weights are referred to as w
in formulas and the source code throughout the chapter, although β
is also used in reference books.
Let's create a parameterized class SingleLinearRegression[T]
to implement the formula described in the previous section. The class implements the data transformation PipeOperator
(refer to the Design template for classifiers section in Appendix A, Basic Concepts).
class SingleLinearRegression[T <% Double](xt: XTSeries[(T, T)])(implicit g: Double => T) extends PipeOperator[Double, T] { type XY = (Double, Double) … }
Model instantiation
The model parameters are computed through training and the value model is instantiated regardless of whether the model is actually validated. A commercial application requires the model to be validated using a methodology such as the K-fold validation. (Refer to the Design template for classifiers section Appendix A, Basic Concepts.)
The application code must provide an implicit conversion g
from Double
to the class type parameter, T
. The training generates the model defined as the regression weights, the tuple (slope, intercept), in the case of single variable linear regression:
val model: Option[XY] = { val data = xt.toArray .map(x => Array[Double](x._1, x._2)) //1 val regr = new SimpleRegression(true) regr.addData(data) //2 Some((regr.getSlope, regr.getIntercept)) //3 }
The tuple of regression weights or coefficients for the model are computed using the SimpleRegression
class from the stats.regression
package of the Apache Commons Math library. The time series is converted to a matrix of double values, data
(line 1
), which is used to initialize the instance of SimpleRegression
(line 2
). The model is initialized with the slope and intercept computed during the training (line 3
).
For our first test case, we compute the single variate linear regression of the price of the copper ETF (the ticker symbol: CU) over a period of 6 months (January 1, 2013 to June 30, 2013):
val price = DataSource(path, false, true, 1) |> adjClose //1 val xy = price.zipWithIndex .map(x => (x._2.toDouble, x._1.toDouble)) //2 val linRegr = SingleLinearRegression(xy) //3 val w1 = linRegr.slope val w0 = linRegr.intercept if( w1 != None ) //4 Display.show(lsErr(xy.toArray, w1.get, w0.get), logger) …
The closing price for the CU ETF is extracted from a CSV file (line 1
) using a DataSource
instance (refer to the Data extraction section Appendix A, Basic Concepts). The 2-dimension time series is generated by converting the indexes of the time series into the x
values using the zipWithIndex
Scala method (line 2
). The regression model, linRegr
, is trained during instantiation of the SingleLinearRegression
class (line 3
). Once the model is created successfully, the least squared error lsErr
of the predicted values and the actual values is computed, as follows:
def lsErr(xyt: Array[XY], w1: Double, w0: Double): Double =
Math.sqrt(xyt.foldLeft(0.0)((err, xy) => {
val diff = xy._2 – w1*xy._1 – w0; err + diff*diff
})/xyt.size)
The original stock price and the linear regression equation are plotted in the following chart:
Although the single variable linear regression is convenient, it is limited to scalar time series. Let's consider the case of multiple variables.
The ordinary least squares regression computes the parameters w of a linear function, y = f(x0, x2 … xd), by minimizing the residual sum of squares. The optimization problem is solved by performing vector and matrix operations (transposition, inversion, and substitution).
There are several methodologies to minimize the residual sum of squares (RSS) for a linear regression:
An overview of these matrix decompositions and optimization techniques can be found in the Linear algebra and Summary of optimization techniques sections in Appendix A, Basic Concepts.
The QR decomposition generates the smallest relative error MSE for the most common least squares problem. The technique is used in our implementation of the least squares regression.
The following implementation of the least squares regression leverages the Apache Commons Math library implementation of the ordinary least squares regression [6:6].
Let's create a class, MultiLinearRegression
, which inherits the implementation of the ordinary least square computation of the Apache Commons Math library OLSMultipleLinearRegression
. The class is defined as a data transformation implementing the PipeOperator
, as follows:
class MultiLinearRegression[T <% Double](xt: XTSeries[Array[T]], y: DblVector) extends OLSMultipleLinearRegression with PipeOperator[Array[T], Double]
The parameterized class takes the following two parameters:
xt
(input matrix)y
, used in trainingThe model for the linear regression is defined by its weights
(or parameters) and its residual sum of squares, rss
. The RSS is included in the model because it provides the client code with important information regarding the accuracy of the underlying technique used to minimize the loss function:
case class RegressionModel(val weights: DblVector, val rss: Double)
The relationship between the different components of the least squares regression is described in the following UML class diagram:
The training is performed during the instantiation of the class MultiLinearRegression
(refer to the Design template for classifiers section in Appendix A, Basic Concepts):
val model: Option[RegressionModel] = { newSampleData(labels, xt.toDblMatrix) //1 val weights = estimateRegressionParameters val wRss =(weights, calculateResidualSumOfSquares) //2 Some(RegressionModel(wRss._1, wRss._2)) }
The least squares algorithm is initialized with the feature observations, xt
, and the target data, labels
, using the newSampleData
method of OLSMultipleLinearRegression
(line 1
).
The model weights
are retrieved using estimateRegressionParameters
(similarly, rss
using calculateResidualSumOfSquares
) (line 2
).
Exception handling
Wrapping up invocation of methods in a third party with a Scala exception handler matters for a couple of reasons: it makes debugging easier by segregating your code from the third party and it allows your code to recover from the exception by re-executing the same function with alternative third-party library methods, whenever possible.
The predictive algorithm for the ordinary least squares regression is implemented by the data transformation |>
. The method predicts the output value given model
and an input value x
:
def |> : PartialFunction[Feature, Double] = { case x: Feature if(model!=None && x.size==model.get.size-1) =>{ val w = model.get.weights x.zip(w.drop(1)).foldLeft(w(0))((s, z) => s + z._1*z._2)) } }
The predictive value is computed by zipping the weight w1
to wn
with the input vector x
and then folding the zipped array.
Trending consists of extracting the long-term movement in a time series. Trend lines can be identified using a multivariate least squares regression. The objective of this first test is to evaluate the filtering capability of the ordinary least squares regression.
The regression is performed on the relative price variation of the Copper ETF (ticker symbol: CU). The selected features are volatility and volume, and the label or target variable is the price change between two consecutive trading sessions y
. The volume, volatility, and price variation for CU between January 1, 2013 and June 30, 2013 are plotted in the following chart:
Let's write the client code to compute the multivariate linear regression, price change = w0 + volatility.w1 + volume.w2:
val path = "resources/data/chap6/CU.csv" val src = DataSource(path, true, true, 1) //2 val price = (src |> YahooFinancials.adjClose).toArray //1 val volatility = src |> YahooFinancials.volatility //1 val volume = src |> YahooFinancials.volume //1 val deltaPrice = price.drop(1) .zip(price.take(price.size -1)) .map(z => z._1 - z._2) ) //3 val data = volatility.zip(volume) .map(z => Array[Double](z._1, z._2)) val features = XTSeries[DblVector](data.dropRight(1)) //4 val regression = MultiLinearRegression[Double](features, deltaPrice) //5 regression.weights match { case Some(w) => Display.show(w, logger) …
The daily session adjusted closing price
, the session volatility
, and the session volume
for the CU ETF is extracted from a CSV file (line 1
) using the DataSource
transformation (line 2
). The array, priceChange
, which is the daily price change between two consecutive trading sessions is computed by duplicating, shifting, and zipping the session closing prices (line 3
). The features are computed by zipping volatility and the volume time series (line 4
). The regression model is trained by instantiating the MultiLinearRegression
class (line 5
) and the model weights
are displayed using an auxiliary display
method (to the logger or standard output) (line 6
).
The original price change time series and the data predicted by the regression are plotted in the following chart:
The least squares regression model is defined by the linear function for the estimation of price variation as follows:
price(t+1)-price(t) = -0.01 + 0.014 volatility – 0.0042.volume
The estimated price change (the dotted line in the preceding chart) represents the long term trend from which the noise is filtered out. In other words, the least squares regression operates as a simple low-pass filter as an alternative to some of the filtering techniques such as discrete Fourier transform or the Kalman filter for dynamic systems (refer to Chapter 3, Data Preprocessing) [6:7].
Although trend detection is an interesting application of the least squares regression, the method has limited filtering capabilities for time series [6:8]:
The second test case is related to features selection. The objective is to discover which subset of initial features generates the most accurate regression model, that is, the model with the smallest residual sum of squares (RSS) on the training set.
Let's consider an initial set of D features {xi}. The objective is to estimate the subset of features {xid} that are the most relevant to the set of observations using a least squares regression. Each subset of features is associated to an fj(x|wj) model:
The OLS can be used to select the model parameters w
if the original set of features is small. Performing the regression of each subset of a large original features set is not practical.
Let's consider the following four financial time series over the period from January 1, 2009 to December 31, 2013:
The problem is to estimate which combination of the three variables S&P 500 index, gold price, and 10-year treasury bond price is the most correlated to the exchange rate of the Yuan. For practical reasons, we use the Exchange Trade Funds CYN as the proxy for the Yuan/US dollar exchange rate (similarly, SPY, GLD, and TLT for S&P 500 index, the spot price of gold, and the 10-year treasury bond price respectively).
Automation of features extraction
The code in this section implements an ad hoc extraction of features with an arbitrary fixed set of models. The process can be easily automated with an optimizer (gradient descent, genetic algorithm, and so on) using 1/RSS as the objective function to be maximized.
The number of models to evaluate is relatively small, so an ad hoc approach to compute the RSS for each combination is acceptable. Have a look at the following graph:
The getRss
method implements the computation of the RSS value given a set of observations xt
and labeled values y
:
def getRss(xt: XTSeries[DblVector], y: DblVector): String = { val regression = MultiLinearRegression[Double](xt, y) //1 val buf = new StringBuilder regression.weights.get .zipWithIndex //2 .foreach(w => { if(w._2 == 0) buf.append(w._1) else buf.append(s" + ${w._1}.x${w._2}") //3 buf.append(s"RSS: ${(regression.rss.get}").toString }
The getRss
method merely trains the model by instantiating the multilinear regression class (line 1
), indexes the array of weights (line 2
), and creates a text representation of the linear regression equation (line 3
).
Once the regression model is trained during the instantiation of the MultiLinearRegression
class, the coefficients of the regression weights
and the RSS value are printed. The rss
method is invoked for any combination of the variables ETF
, GLD
, SPY
, and TLT
against the label CNY
:
val symbols = Array[String]("CNY", "GLD", "SPY", "TLT") val smoothingPeriod = 16 val movAvg = SimpleMovingAverage[Double](smoothingPeriod) //4 val input= symbols.map(s=>DataSource(path+s+".csv",true,true, 1)) .map( _ |> YahooFinancials.adjClose ) //5 .map(x=> movAvg |> XTSeries[Double](x)) val features = input.drop(1) val featuresList = List[(String, DblMatrix)]( ("CNY=f(SPY,GLD,TLT)", features.map( _.toArray).transpose),//6 ("CNY=f(GLD,TLT)", features.drop(1).map( _.toArray).transpose), … } featuresList.foreach(x => Display.show(x._1 + getRss(XTSeries[DblVector](x._2), input(0)), logger)) //7
The dataset is large (1,260 trading sessions) and noisy enough to warrant filtering using a simple moving average with a period of 16 trading sessions, movAvg
(line 4
). The time series are extracted from CSV files using the DataSource
class, then smoothed using a sequence of Array.map
invocations (line 5
). The first map extracts the content of the files associated to the stock ticker symbol, assuming that the names of the files are formatted as path/symbol.csv
.
For the sake of simplicity, the option type returned by the pipe operator is not validated.
The first model using the three variables SPY
, GLD
, and TLT
is created by transposing them by the xt.size
matrix (line 6
). The RSS value is computed by invoking the rss
method (line 7
). The second model using two variables, SPY
and TLT
, is created by filtering out the GLD
time series. The process is repeated for all other models. Have a look at the following screenshot:
The output results clearly show that the three variable regression CNY=f(SPY, GLD, TLT) is the most accurate or fittest model for the CNY
time series, followed by CNY =f(SPY, TLT). Therefore, the feature selection process generates the features set, {SPY, GLD, TLT}
.
Let's plot the model against the raw data:
The regression model smoothed the original CNY
time series. It weeded out all but the most significant price variation.
However, the RSS does not always provide an accurate visualization of the fitness of the regression model. The fitness of the regression model is commonly assessed using r2 statistics. The r2 value is a number that indicates how well data fits a statistical model.
The implementation of the computation of the r2 statistics is fairly simple. For each model fj, the rssSum
method computes the tuple {rss, sum of predicted values}:
def rssSum(xt: XTSeries[DblVector], y: DblVector): XY = { val regression = MultiLinearRegression[Double](xt, y) (regression.rss.get, xt.toArray.zip(y).foldLeft(0.0)(s,x) => val d = (x._2 - (regression |> x._1)) s + d*d })
Finally, the process is repeated for each model and the sum of the predicted values for each model is summed (line 8
), averaged (line 9
), and then used in the r2 formula (line 10
):
var xsRss = new ListBuffer[Double]() val tss = featuresList.foldLeft(0.0)((s, x) => { //8 val _tss = rssSum(XTSeries[DblVector](x._2), input(0)) xsRss.append(_tss._1) s + _tss._2 //9 })/xsRss.size xsRss.map( 1.0 - _/tss) //10
The graph plotting the r2 value for each model confirms that the three features model is the most accurate: