Chapter 6. Regression and Regularization

In the first chapter, we briefly introduced the binary logistic regression (binomial logistic regression for a single variable) as our first test case. The purpose was to illustrate the concept of discriminative classification. There are many more regression models, starting with the ubiquitous ordinary least-square linear regression and the logistic regression [6:1].

The purpose of regression is to minimize a loss function, with the residual sum of squares (RSS) being one that is commonly used. The problem of overfitting described in the Overfitting section of Chapter 2, Hello World!, can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization.

The first section of this chapter will describe and implement the linear least-squares regression. The second section will introduce the concept of regularization with an implementation of the Ridge regression.

Finally, the logistic regression will be revisited in detail from the perspective of a classification model.

Linear regression

Linear regression is by far the most widely used, or at least the most commonly known, regression method. The terminology is usually associated with the concept of fitting a model to data. Linear regression can be implemented using the least squares method. Practically, the least squares method entails the minimization of the sum of the squares of the error between the observed data and the actual model.

The least squares problems fall into two categories:

  • Ordinary least squares
  • Nonlinear least squares

One-variate linear regression

Let's start with the simplest form of linear regression, which is the single variable regression, in order to introduce the terms and concepts behind linear regression. In its simplest interpretation, the one-variate linear regression consists of fitting a line to a set of data points {x, y}.

Note

Single variable linear regression is given by the following formula:

One-variate linear regression

Here, w1 is the slope, w0 is the intercept, f is the linear function that minimizes the RSS, and (xj, yj) is a set of n observations.

The RSS is also known as the sum of squared errors (SSE). The mean squared error (MSE) for n observations is defined as the ratio RSS/n.

Note

Terminology

The terminology used in the scientific literature regarding regression is a bit confusing at times. Regression weights are also known as regression coefficients or regression parameters. The weights are referred to as w in formulas and the source code throughout the chapter, although β is also used in reference books.

Implementation

Let's create a parameterized class SingleLinearRegression[T] to implement the formula described in the previous section. The class implements the data transformation PipeOperator (refer to the Design template for classifiers section in Appendix A, Basic Concepts).

class SingleLinearRegression[T <% Double](xt: XTSeries[(T, T)])(implicit g: Double => T) extends PipeOperator[Double, T] {
  type XY = (Double, Double)
  …
}

Tip

Model instantiation

The model parameters are computed through training and the value model is instantiated regardless of whether the model is actually validated. A commercial application requires the model to be validated using a methodology such as the K-fold validation. (Refer to the Design template for classifiers section Appendix A, Basic Concepts.)

The application code must provide an implicit conversion g from Double to the class type parameter, T. The training generates the model defined as the regression weights, the tuple (slope, intercept), in the case of single variable linear regression:

val model: Option[XY] = {
  val data = xt.toArray
               .map(x => Array[Double](x._1, x._2))  //1
  val regr = new SimpleRegression(true)  
  regr.addData(data)  //2
  Some((regr.getSlope, regr.getIntercept))  //3
}

The tuple of regression weights or coefficients for the model are computed using the SimpleRegression class from the stats.regression package of the Apache Commons Math library. The time series is converted to a matrix of double values, data (line 1), which is used to initialize the instance of SimpleRegression (line 2). The model is initialized with the slope and intercept computed during the training (line 3).

Tip

private vs. private[this]

A private value or variable can be accessed only by all the instances of a class. A value declared private[this] can be manipulated only by this instance. For example, the value model can be accessed only by this instance of SingleLinearRegression.

Test case

For our first test case, we compute the single variate linear regression of the price of the copper ETF (the ticker symbol: CU) over a period of 6 months (January 1, 2013 to June 30, 2013):

val price = DataSource(path, false, true, 1) |> adjClose //1
val xy = price.zipWithIndex
              .map(x => (x._2.toDouble, x._1.toDouble)) //2

val linRegr = SingleLinearRegression(xy) //3
val w1 = linRegr.slope
val w0 = linRegr.intercept
if( w1 != None ) //4
  Display.show(lsErr(xy.toArray, w1.get, w0.get), logger)
…

The closing price for the CU ETF is extracted from a CSV file (line 1) using a DataSource instance (refer to the Data extraction section Appendix A, Basic Concepts). The 2-dimension time series is generated by converting the indexes of the time series into the x values using the zipWithIndex Scala method (line 2). The regression model, linRegr, is trained during instantiation of the SingleLinearRegression class (line 3). Once the model is created successfully, the least squared error lsErr of the predicted values and the actual values is computed, as follows:

def lsErr(xyt: Array[XY], w1: Double, w0: Double): Double = 
  Math.sqrt(xyt.foldLeft(0.0)((err, xy) => {
     val diff = xy._2 – w1*xy._1 – w0; err + diff*diff
  })/xyt.size)

The original stock price and the linear regression equation are plotted in the following chart:

Test case

Single variable linear regression – Copper ETF daily price

Although the single variable linear regression is convenient, it is limited to scalar time series. Let's consider the case of multiple variables.

Ordinary least squares (OLS) regression

The ordinary least squares regression computes the parameters w of a linear function, y = f(x0, x2 … xd), by minimizing the residual sum of squares. The optimization problem is solved by performing vector and matrix operations (transposition, inversion, and substitution).

Note

Minimization of the loss function is given by the following formula:

Ordinary least squares (OLS) regression

where wj:0,D is the D regression (or model) parameters (or weights), (xi, yi)i:0,n-1 is n observations of vector x and output value y, and f is the linear multivariate function, y = f(x0, x1, …,xd, ..).

There are several methodologies to minimize the residual sum of squares (RSS) for a linear regression:

  • Resolution of the set of n equations with d variables (weights) using the QR decomposition of the n by d matrix representing the time series of n observations of vector of d dimension (d features) with n > d [6:2]
  • Singular value decomposition on the observations-features matrix, in the case where the dimension d exceeds the number of observations n [6:3]
  • Gradient descent [6:4]
  • Stochastic gradient descent [6:5]

An overview of these matrix decompositions and optimization techniques can be found in the Linear algebra and Summary of optimization techniques sections in Appendix A, Basic Concepts.

The QR decomposition generates the smallest relative error MSE for the most common least squares problem. The technique is used in our implementation of the least squares regression.

Design

The following implementation of the least squares regression leverages the Apache Commons Math library implementation of the ordinary least squares regression [6:6].

Let's create a class, MultiLinearRegression, which inherits the implementation of the ordinary least square computation of the Apache Commons Math library OLSMultipleLinearRegression. The class is defined as a data transformation implementing the PipeOperator, as follows:

class MultiLinearRegression[T <% Double](xt: XTSeries[Array[T]], y: DblVector) extends OLSMultipleLinearRegression with PipeOperator[Array[T], Double]

The parameterized class takes the following two parameters:

  • The time series of the variables vector xt (input matrix)
  • The labeled output values, y, used in training

The model for the linear regression is defined by its weights (or parameters) and its residual sum of squares, rss. The RSS is included in the model because it provides the client code with important information regarding the accuracy of the underlying technique used to minimize the loss function:

case class RegressionModel(val weights: DblVector, val rss: Double)

The relationship between the different components of the least squares regression is described in the following UML class diagram:

Design

Implementation

The training is performed during the instantiation of the class MultiLinearRegression (refer to the Design template for classifiers section in Appendix A, Basic Concepts):

val model: Option[RegressionModel] = {
   newSampleData(labels, xt.toDblMatrix) //1
   val weights = estimateRegressionParameters
   val wRss =(weights, calculateResidualSumOfSquares) //2
   Some(RegressionModel(wRss._1, wRss._2))
}

The least squares algorithm is initialized with the feature observations, xt, and the target data, labels, using the newSampleData method of OLSMultipleLinearRegression (line 1).

The model weights are retrieved using estimateRegressionParameters (similarly, rss using calculateResidualSumOfSquares) (line 2).

Tip

Exception handling

Wrapping up invocation of methods in a third party with a Scala exception handler matters for a couple of reasons: it makes debugging easier by segregating your code from the third party and it allows your code to recover from the exception by re-executing the same function with alternative third-party library methods, whenever possible.

The predictive algorithm for the ordinary least squares regression is implemented by the data transformation |>. The method predicts the output value given model and an input value x:

def |> : PartialFunction[Feature, Double] = {
  case x: Feature if(model!=None && x.size==model.get.size-1) =>{
    val w = model.get.weights
    x.zip(w.drop(1)).foldLeft(w(0))((s, z) => s + z._1*z._2))
  }
}

The predictive value is computed by zipping the weight w1 to wn with the input vector x and then folding the zipped array.

Test case 1 – trending

Trending consists of extracting the long-term movement in a time series. Trend lines can be identified using a multivariate least squares regression. The objective of this first test is to evaluate the filtering capability of the ordinary least squares regression.

The regression is performed on the relative price variation of the Copper ETF (ticker symbol: CU). The selected features are volatility and volume, and the label or target variable is the price change between two consecutive trading sessions y. The volume, volatility, and price variation for CU between January 1, 2013 and June 30, 2013 are plotted in the following chart:

Test case 1 – trending

Chart for price variation, volatility, and trading volume for Copper ETF

Let's write the client code to compute the multivariate linear regression, price change = w0 + volatility.w1 + volume.w2:

val path = "resources/data/chap6/CU.csv"
val src = DataSource(path, true, true, 1) //2
val price = (src |> YahooFinancials.adjClose).toArray  //1
val volatility = src |> YahooFinancials.volatility  //1
val volume = src |> YahooFinancials.volume  //1

val deltaPrice = price.drop(1)
                       .zip(price.take(price.size -1))
                       .map(z => z._1 - z._2) )  //3
val data = volatility.zip(volume)
                     .map(z => Array[Double](z._1, z._2))
val features = XTSeries[DblVector](data.dropRight(1)) //4
val regression = MultiLinearRegression[Double](features, deltaPrice) //5
regression.weights match {
  case Some(w) => Display.show(w, logger)
 …

The daily session adjusted closing price, the session volatility, and the session volume for the CU ETF is extracted from a CSV file (line 1) using the DataSource transformation (line 2). The array, priceChange, which is the daily price change between two consecutive trading sessions is computed by duplicating, shifting, and zipping the session closing prices (line 3). The features are computed by zipping volatility and the volume time series (line 4). The regression model is trained by instantiating the MultiLinearRegression class (line 5) and the model weights are displayed using an auxiliary display method (to the logger or standard output) (line 6).

The original price change time series and the data predicted by the regression are plotted in the following chart:

Test case 1 – trending

Price variation and the least squares regression for copper ETF according to volatility and volume

The least squares regression model is defined by the linear function for the estimation of price variation as follows:

price(t+1)-price(t) = -0.01 + 0.014 volatility – 0.0042.volume

The estimated price change (the dotted line in the preceding chart) represents the long term trend from which the noise is filtered out. In other words, the least squares regression operates as a simple low-pass filter as an alternative to some of the filtering techniques such as discrete Fourier transform or the Kalman filter for dynamic systems (refer to Chapter 3, Data Preprocessing) [6:7].

Although trend detection is an interesting application of the least squares regression, the method has limited filtering capabilities for time series [6:8]:

  • It is sensitive to outliers
  • It put a greater weight to the first and last few observations that need to be discarded
  • As a deterministic method, it does not support noise analysis (distribution, frequencies, and so on)

Test case 2 – features selection

The second test case is related to features selection. The objective is to discover which subset of initial features generates the most accurate regression model, that is, the model with the smallest residual sum of squares (RSS) on the training set.

Let's consider an initial set of D features {xi}. The objective is to estimate the subset of features {xid} that are the most relevant to the set of observations using a least squares regression. Each subset of features is associated to an fj(x|wj) model:

Test case 2 – features selection

The OLS can be used to select the model parameters w if the original set of features is small. Performing the regression of each subset of a large original features set is not practical.

Note

The features selection can be expressed mathematically as follows:

Test case 2 – features selection

Let's consider the following four financial time series over the period from January 1, 2009 to December 31, 2013:

  • The exchange rate of Chinese Yuan to US Dollar
  • The S&P 500 index
  • The spot price of gold
  • The 10-year treasury bond price

The problem is to estimate which combination of the three variables S&P 500 index, gold price, and 10-year treasury bond price is the most correlated to the exchange rate of the Yuan. For practical reasons, we use the Exchange Trade Funds CYN as the proxy for the Yuan/US dollar exchange rate (similarly, SPY, GLD, and TLT for S&P 500 index, the spot price of gold, and the 10-year treasury bond price respectively).

Note

Automation of features extraction

The code in this section implements an ad hoc extraction of features with an arbitrary fixed set of models. The process can be easily automated with an optimizer (gradient descent, genetic algorithm, and so on) using 1/RSS as the objective function to be maximized.

The number of models to evaluate is relatively small, so an ad hoc approach to compute the RSS for each combination is acceptable. Have a look at the following graph:

Test case 2 – features selection

Graph of the Chinese Yuan exchange rate, gold, 10-year treasury bond price, and S&P 500 index

The getRss method implements the computation of the RSS value given a set of observations xt and labeled values y:

def getRss(xt: XTSeries[DblVector], y: DblVector): String = {
  val regression = MultiLinearRegression[Double](xt, y) //1
  val buf = new StringBuilder
  regression.weights.get
            .zipWithIndex  //2
            .foreach(w => {
                if(w._2 == 0) buf.append(w._1)
                else buf.append(s" + ${w._1}.x${w._2}") //3
  buf.append(s"RSS: ${(regression.rss.get}").toString
}

The getRss method merely trains the model by instantiating the multilinear regression class (line 1), indexes the array of weights (line 2), and creates a text representation of the linear regression equation (line 3).

Once the regression model is trained during the instantiation of the MultiLinearRegression class, the coefficients of the regression weights and the RSS value are printed. The rss method is invoked for any combination of the variables ETF, GLD, SPY, and TLT against the label CNY:

val symbols = Array[String]("CNY", "GLD", "SPY", "TLT")
val smoothingPeriod = 16
val movAvg = SimpleMovingAverage[Double](smoothingPeriod)  //4

val input= symbols.map(s=>DataSource(path+s+".csv",true,true, 1))
                  .map( _ |> YahooFinancials.adjClose ) //5
                  .map(x=> movAvg |> XTSeries[Double](x))
val features = input.drop(1)
val featuresList = List[(String, DblMatrix)](
  ("CNY=f(SPY,GLD,TLT)", features.map( _.toArray).transpose),//6
  ("CNY=f(GLD,TLT)", features.drop(1).map( _.toArray).transpose),
   …
}
featuresList.foreach(x => Display.show(x._1 +   
    getRss(XTSeries[DblVector](x._2), input(0)), logger)) //7

The dataset is large (1,260 trading sessions) and noisy enough to warrant filtering using a simple moving average with a period of 16 trading sessions, movAvg (line 4). The time series are extracted from CSV files using the DataSource class, then smoothed using a sequence of Array.map invocations (line 5). The first map extracts the content of the files associated to the stock ticker symbol, assuming that the names of the files are formatted as path/symbol.csv.

For the sake of simplicity, the option type returned by the pipe operator is not validated.

The first model using the three variables SPY, GLD, and TLT is created by transposing them by the xt.size matrix (line 6). The RSS value is computed by invoking the rss method (line 7). The second model using two variables, SPY and TLT, is created by filtering out the GLD time series. The process is repeated for all other models. Have a look at the following screenshot:

Test case 2 – features selection

The output results clearly show that the three variable regression CNY=f(SPY, GLD, TLT) is the most accurate or fittest model for the CNY time series, followed by CNY =f(SPY, TLT). Therefore, the feature selection process generates the features set, {SPY, GLD, TLT}.

Let's plot the model against the raw data:

Test case 2 – features selection

Ordinary least regression on the Chinese Yuan ETF (CNY)

The regression model smoothed the original CNY time series. It weeded out all but the most significant price variation.

However, the RSS does not always provide an accurate visualization of the fitness of the regression model. The fitness of the regression model is commonly assessed using r2 statistics. The r2 value is a number that indicates how well data fits a statistical model.

Note

RSS and r2 statistics are related by the following formulae:

Test case 2 – features selection

The implementation of the computation of the r2 statistics is fairly simple. For each model fj, the rssSum method computes the tuple {rss, sum of predicted values}:

def rssSum(xt: XTSeries[DblVector], y: DblVector): XY = {
   val regression = MultiLinearRegression[Double](xt, y)
(regression.rss.get, 
   xt.toArray.zip(y).foldLeft(0.0)(s,x) =>
   val d = (x._2 - (regression |>  x._1))
   s + d*d
})

Finally, the process is repeated for each model and the sum of the predicted values for each model is summed (line 8), averaged (line 9), and then used in the r2 formula (line 10):

var xsRss = new ListBuffer[Double]()
val tss = featuresList.foldLeft(0.0)((s, x) => { //8
  val _tss = rssSum(XTSeries[DblVector](x._2), input(0))
  xsRss.append(_tss._1)
  s + _tss._2 //9
})/xsRss.size
xsRss.map( 1.0 - _/tss) //10

The graph plotting the r2 value for each model confirms that the three features model is the most accurate:

Test case 2 – features selection

Note

General linear regression

The concept of linear regression is not restricted to polynomial fitting models such as y = w0 + w1.x + w2.x2 + …+ wnxn. Regression models can also be defined as a linear combination of basis functions as ϕj: y = w0 + w11(x) + w2ϕ2(x) + … + wnn(x) [6:9].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset