Regularization

The ordinary least squares method for finding the regression parameters is a specific case of the maximum likelihood. Therefore, regression models are subject to the same challenge in terms of overfitting as any other discriminative models. You are already aware of the fact that regularization is used to reduce model complexity and avoid overfitting, as stated in the Overfitting section in Chapter 2, Hello World!

Ln roughness penalty

Regularization consists of adding a J(w) penalty function to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (also known as weights) from reaching high values. A model that fits a training set very well tends to have many features variables with relatively large weights. This process is known as shrinkage. Practically, shrinkage involves adding a function with model parameters as an argument to the loss function (M5):

Ln roughness penalty

The penalty function is completely independent of the training set {x,y}. The penalty term is usually expressed as a power to the function of the norm of the model parameters (or weights) wd. For a model of D dimension, the generic Lp -norm is defined as follows (M6):

Ln roughness penalty

Note

Notation

Regularization applies to parameters or weights associated with observations. In order to be consistent with our notation, w0 being the intercept value, the regularization applies to the parameters w1…wd.

The two most commonly used penalty functions for regularization are L1 and L2.

Note

Regularization in machine learning

The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as a support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS.

The L1 regularization applied to the linear regression is known as the lasso regularization. The ridge regression is a linear regression that uses the L2 regularization penalty.

You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection [6:10] [6:11]:

  • Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For a large nonsparse dataset, L2 has a smaller estimation error than L1.
  • Feature selection: L1 is more effective in reducing the regression weights for features with a higher value than L2. Therefore, L1 is a reliable features selection tool.
  • Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model); for the same reason, L1 is more appropriate for selecting features.
  • Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty, w2, is a continuous and differentiable function for which the first and second derivatives can be computed (convex minimization). The L1 term is the summation of |wi| and therefore not differentiable.

Note

Terminology

The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay.

Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor.

Ridge regression

The ridge regression is a multivariate linear regression with an L2-norm penalty term (M7):

Ridge regression

The computation of the ridge regression parameters requires the resolution of a system of linear equations that are similar to the linear regression.

Note

M8: The matrix representation of ridge regression closed form for an input dataset X, a regularization factor λ, and expected values vector y is defined as follows (I is the identity matrix):

Ridge regression

M9: The matrices equation is resolved using the QR decomposition as follows:

Ridge regression

Design

The implementation of the ridge regression adds the L2 regularization term to the multiple linear regression computation of the Apache Commons Math Library. The methods of RidgeRegression have the same signature as their ordinary least squares counterparts except for the lambda L2 penalty term (line 1):

class RidgeRegression[T <: AnyVal](  //1
     xt: XVSeries[T], 
     expected: DblVector, 
     lambda: Double)(implicit f: T => Double)
   extends ITransform[Array[T]](xt) with Regression 
       with Monitor[Double] { //2

  type V = Double //3
  override def train: Option[RegressionModel]  //4
  override def |> : PartialFunction[Array[T], Try[V]]
}

The RidgeRegression class is implemented as an ITransform data transformation whose model is implicitly derived from the input data (training set), as described in the Monadic data transformation section in Chapter 2, Hello World! (line 2). The V type of the output of the |> predictive function is a Double (line 3). The model is created through training during the instantiation of the class (line 4).

The relationship between the different components of the ridge regression is described in the following UML class diagram:

Design

The UML class diagram for the ridge regression

The UML diagram omits the helper traits or classes such as Monitor or the Apache Commons Math components.

Implementation

Let's take a look at the training method, train:

def train: RegressionModel = {
  val mlr = new RidgeRAdapter(lambda, xt.head.size) //5
  mlr.createModel(data, expected) //6
  RegressionModel(mlr.getWeights, mlr.getRss)  //7
}

It is rather simple; it initialized and executed the regression algorithm implemented in the RidgeRAdapter class (line 5), which acts as an adapter to the internal Apache Commons Math library AbstractMultipleLinearRegression class in the org.apache.commons.math3.stat.regression package (line 6). The method returns a fully initialized regression model that is similar to the ordinary least squared regression (line 7).

Let's take a look at the RidgeRAdapter adapter class:

class RidgeRAdapter(
    lambda: Double, 
    dim: Int) extends AbstractMultipleLinearRegression {
  var qr: QRDecomposition = _  //8
  
  def createModel(x: DblMatrix, y: DblVector): Unit ={ //9
    this.newXSampleData(x) //10
    super.newYSampleData(y.toArray)
  }
  def getWeights: DblArray = calculateBeta.toArray //11
  def getRss: Double = rss
}

The constructor for the RidgeRAdapter class takes two parameters: the lambda L2 penalty parameter and the number of features, dim, in an observation. The QR decomposition in the AbstractMultipleLinearRegression base class does not process the penalty term (line 8). Therefore, the creation of the model has to be redefined in the createModel method (line 9), which requires to override the newXSampleData method (line 10):

override protected def newXSampleData(x: DblMatrix): Unit =  {
  super.newXSampleData(x)    //12
  val r: RealMatrix = getX
  Range(0, dim).foreach(i => 
        r.setEntry(i, i, r.getEntry(i,i) + lambda) ) //13
  qr = new QRDecomposition(r) //14
}

The newXSampleData method overrides the default observations-features r matrix (line 12) by adding the lambda coefficient to its diagonal elements (line 13), and then updating the QR decomposition components (line 14).

The weights for the ridge regression models is computed by implementing the M6 formula (line 11) in the calculateBeta overridden method (line 15):

override protected def calculateBeta: RealVector =
   qr.getSolver().solve(getY()) //15

The predictive algorithm for the ordinary least squares regression is implemented by the |> data transformation. The method predicts the output value, given a model and an input x value (line 16):

def |> : PartialFunction[Array[T], Try[V]] = {
  case x: Array[T] if(isModel && 
      x.length == model.get.size-1) => 
        Try( dot(x, model.get) ) //16
}

Test case

The objective of the test case is to identify the impact of the L2 penalization on the RSS value and then compare the predicted values with the original values.

Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as features. The implementation of the extraction of observations is identical to that for the least squares regression, as described in the previous section:

val LAMBDA: Double = 0.5
val src = DataSource(path, true, true, 1)  //17

for {
  price <- src.get(adjClose)   //18
  volatility <- src.get(volatility) //19
  volume <- src.get(volume)  //20
  (features, expected) <- differentialData(volatility, 
              volume, price, diffDouble) //21
  regression <- RidgeRegression[Double](features, 
expected, LAMBDA)  //22
} yield {
  if( regression.isModel ) {
    val trend = features
               .map( dot(_, regression.weights.get) )  //23

    val y1 = predict(0.2, expected, volatility, volume) //24
    val y2 = predict(5.0, expected, volatility, volume)
    val output = (2 until 10 by 2).map( n => 
          predict(n*0.1, expected, volatility, volume) )
  }
}

Let's take a look at the steps required for the execution of the test. The steps consist of collecting data, extracting the features and expected values, and training the ridge regression model:

  1. Create a data source extractor for the price trading session closing, the volatility session, and the volume session for the ETF CU using the DataSource transformation (line 17).
  2. Extract the closing price of the ETF (line 18), its volatility within a trading session (line 19), and the volume trading during the same session (line 20).
  3. Generate the labeled data as a pair of features (the relative volatility and relative volume for the ETF) and the expected outcome {0, 1} for training the model, where 1 represents the increase in the price and 0 represents the decrease in the price (line 21). The differentialData generic method of the XTSeries singleton is described in the Time series in Scala section in Chapter 3, Data Preprocessing.
  4. Instantiate the ridge regression using the features set and the expected change in the daily stock price (line 22).
  5. Compute the trend values using the dot function of the RegressionModel singleton (line 23).
  6. Execute a using the ridge regression is implemented by the predict method (line 24).

The code is as follows:

def predict(
    lambda: Double, 
    deltaPrice: DblVector, 
    volatility: DblVector, 
    volume: DblVector): DblVector = {

  val observations = zipToSeries(volatility, volume)//25
  val regression = new RidgeRegression[Double](observations, 
          deltaPrice, lambda)
  val fnRegr = regression |> //26
  observations.map( fnRegr(_).get)  //27
}

The observations are extracted from the volatility and volume time series (line 25). The predictive method for the fnRegr ridge regression (line 26) is applied to each observation (line 27). The RSS value, rss, is plotted for different values of λ, as shown in the following chart:

Test case

The graph of RSS versus lambda for the Copper ETF

The residual sum of squares decreases as λ increases. The curve seems to be reaching for a minimum around λ = 1. The case of λ = 0 corresponds to the least squares regression.

Next, let's plot the RSS value for λ varying between 1 and 100:

Test case

The graph of RSS versus a large value Lambda for the Copper ETF

This time around, the value of RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings [6:12]. As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases.

Let's plot the predicted price variation of the Copper ETF using the ridge regression with different values of lambda (λ):

Test case

The graph of ridge regression on the Copper ETF price variation with a variable, lambda

The original price variation of the Copper ETF, Δ = price(t + 1) - price(t), is plotted as λ = 0. Let's analyze the behavior of the predictive model for different values of λ:

  • The predicted values for λ = 0.8 is very similar to the original data.
  • The predicted values for λ = 2 follow the pattern of the original data with a reduction of large variations (peaks and troves).
  • The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced.

The logistic regression, which was briefly introduced in the Let's kick the tires section in Chapter 1, Getting Started, is the next logical regression model to be discussed. The logistic regression relies on optimization methods. Let's go through a short refresher course in optimization before diving into the logistic regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset