Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regularization

The ordinary least squares method for finding the regression parameters is a specific case of the maximum likelihood. Therefore, regression models are subject to the same challenge in terms of overfitting as any other discriminative models. You are already aware of the fact that regularization is used to reduce model complexity and avoid overfitting, as stated in the Overfitting section in Chapter 2, Hello World!

L_n roughness penalty

Regularization consists of adding a J(w) penalty function to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (also known as weights) from reaching high values. A model that fits a training set very well tends to have many features variables with relatively large weights. This process is known as shrinkage. Practically, shrinkage involves adding a function with model parameters as an argument to the loss function (M5):

The penalty function is completely independent of the training set {x,y}. The penalty term is usually expressed as a power to the function of the norm of the model parameters (or weights) w_d. For a model of D dimension, the generic L_p -norm is defined as follows (M6):

Note

Notation

Regularization applies to parameters or weights associated with observations. In order to be consistent with our notation, w₀ being the intercept value, the regularization applies to the parameters w₁…w_d.

The two most commonly used penalty functions for regularization are L₁ and L₂.

Note

Regularization in machine learning

The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as a support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS.

The L₁ regularization applied to the linear regression is known as the lasso regularization. The ridge regression is a linear regression that uses the L₂ regularization penalty.

You may wonder which regularization makes sense for a given training set. In a nutshell, L₂ and L₁ regularizations differ in terms of computation efficiency, estimation, and features selection [6:10] [6:11]:

Model estimation: L₁ generates a sparser estimation of the regression parameters than L₂. For a large nonsparse dataset, L₂ has a smaller estimation error than L₁.
Feature selection: L₁ is more effective in reducing the regression weights for features with a higher value than L₂. Therefore, L1 is a reliable features selection tool.
Overfitting: Both L₁ and L₂ reduce the impact of overfitting. However, L₁ has a significant advantage in overcoming overfitting (or excessive complexity of a model); for the same reason, L₁ is more appropriate for selecting features.
Computation: L₂ is conducive to a more efficient computation model. The summation of the loss function and L₂ penalty, w², is a continuous and differentiable function for which the first and second derivatives can be computed (convex minimization). The L1 term is the summation of |w_i| and therefore not differentiable.

Note

Terminology

The ridge regression is sometimes called the penalized least squares regression. The L₂ regularization is also known as the weight decay.

Let's implement the ridge regression, and then evaluate the impact of the L₂-norm penalty factor.

Ridge regression

The ridge regression is a multivariate linear regression with an L₂-norm penalty term (M7):

The computation of the ridge regression parameters requires the resolution of a system of linear equations that are similar to the linear regression.

Note

M8: The matrix representation of ridge regression closed form for an input dataset X, a regularization factor λ, and expected values vector y is defined as follows (I is the identity matrix):

M9: The matrices equation is resolved using the QR decomposition as follows:

Design

The implementation of the ridge regression adds the L₂ regularization term to the multiple linear regression computation of the Apache Commons Math Library. The methods of RidgeRegression have the same signature as their ordinary least squares counterparts except for the lambda L₂ penalty term (line 1):

class RidgeRegression[T <: AnyVal](  //1
     xt: XVSeries[T], 
     expected: DblVector, 
     lambda: Double)(implicit f: T => Double)
   extends ITransform[Array[T]](xt) with Regression 
       with Monitor[Double] { //2

  type V = Double //3
  override def train: Option[RegressionModel]  //4
  override def |> : PartialFunction[Array[T], Try[V]]
}

The RidgeRegression class is implemented as an ITransform data transformation whose model is implicitly derived from the input data (training set), as described in the Monadic data transformation section in Chapter 2, Hello World! (line 2). The V type of the output of the |> predictive function is a Double (line 3). The model is created through training during the instantiation of the class (line 4).

The relationship between the different components of the ridge regression is described in the following UML class diagram:

The UML class diagram for the ridge regression

The UML diagram omits the helper traits or classes such as Monitor or the Apache Commons Math components.

Implementation

Let's take a look at the training method, train:

def train: RegressionModel = {
  val mlr = new RidgeRAdapter(lambda, xt.head.size) //5
  mlr.createModel(data, expected) //6
  RegressionModel(mlr.getWeights, mlr.getRss)  //7
}

It is rather simple; it initialized and executed the regression algorithm implemented in the RidgeRAdapter class (line 5), which acts as an adapter to the internal Apache Commons Math library AbstractMultipleLinearRegression class in the org.apache.commons.math3.stat.regression package (line 6). The method returns a fully initialized regression model that is similar to the ordinary least squared regression (line 7).

Let's take a look at the RidgeRAdapter adapter class:

class RidgeRAdapter(
    lambda: Double, 
    dim: Int) extends AbstractMultipleLinearRegression {
  var qr: QRDecomposition = _  //8
  
  def createModel(x: DblMatrix, y: DblVector): Unit ={ //9
    this.newXSampleData(x) //10
    super.newYSampleData(y.toArray)
  }
  def getWeights: DblArray = calculateBeta.toArray //11
  def getRss: Double = rss
}

The constructor for the RidgeRAdapter class takes two parameters: the lambda L₂ penalty parameter and the number of features, dim, in an observation. The QR decomposition in the AbstractMultipleLinearRegression base class does not process the penalty term (line 8). Therefore, the creation of the model has to be redefined in the createModel method (line 9), which requires to override the newXSampleData method (line 10):

override protected def newXSampleData(x: DblMatrix): Unit =  {
  super.newXSampleData(x)    //12
  val r: RealMatrix = getX
  Range(0, dim).foreach(i => 
        r.setEntry(i, i, r.getEntry(i,i) + lambda) ) //13
  qr = new QRDecomposition(r) //14
}

The newXSampleData method overrides the default observations-features r matrix (line 12) by adding the lambda coefficient to its diagonal elements (line 13), and then updating the QR decomposition components (line 14).

The weights for the ridge regression models is computed by implementing the M6 formula (line 11) in the calculateBeta overridden method (line 15):

override protected def calculateBeta: RealVector =
   qr.getSolver().solve(getY()) //15

The predictive algorithm for the ordinary least squares regression is implemented by the |> data transformation. The method predicts the output value, given a model and an input x value (line 16):

def |> : PartialFunction[Array[T], Try[V]] = {
  case x: Array[T] if(isModel && 
      x.length == model.get.size-1) => 
        Try( dot(x, model.get) ) //16
}

Test case

The objective of the test case is to identify the impact of the L₂ penalization on the RSS value and then compare the predicted values with the original values.

Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as features. The implementation of the extraction of observations is identical to that for the least squares regression, as described in the previous section:

val LAMBDA: Double = 0.5
val src = DataSource(path, true, true, 1)  //17

for {
  price <- src.get(adjClose)   //18
  volatility <- src.get(volatility) //19
  volume <- src.get(volume)  //20
  (features, expected) <- differentialData(volatility, 
              volume, price, diffDouble) //21
  regression <- RidgeRegression[Double](features, 
expected, LAMBDA)  //22
} yield {
  if( regression.isModel ) {
    val trend = features
               .map( dot(_, regression.weights.get) )  //23

    val y1 = predict(0.2, expected, volatility, volume) //24
    val y2 = predict(5.0, expected, volatility, volume)
    val output = (2 until 10 by 2).map( n => 
          predict(n*0.1, expected, volatility, volume) )
  }
}

Let's take a look at the steps required for the execution of the test. The steps consist of collecting data, extracting the features and expected values, and training the ridge regression model:

Create a data source extractor for the price trading session closing, the volatility session, and the volume session for the ETF CU using the DataSource transformation (line 17).
Extract the closing price of the ETF (line 18), its volatility within a trading session (line 19), and the volume trading during the same session (line 20).
Generate the labeled data as a pair of features (the relative volatility and relative volume for the ETF) and the expected outcome {0, 1} for training the model, where 1 represents the increase in the price and 0 represents the decrease in the price (line 21). The differentialData generic method of the XTSeries singleton is described in the Time series in Scala section in Chapter 3, Data Preprocessing.
Instantiate the ridge regression using the features set and the expected change in the daily stock price (line 22).
Compute the trend values using the dot function of the RegressionModel singleton (line 23).
Execute a using the ridge regression is implemented by the predict method (line 24).

The code is as follows:

def predict(
    lambda: Double, 
    deltaPrice: DblVector, 
    volatility: DblVector, 
    volume: DblVector): DblVector = {

  val observations = zipToSeries(volatility, volume)//25
  val regression = new RidgeRegression[Double](observations, 
          deltaPrice, lambda)
  val fnRegr = regression |> //26
  observations.map( fnRegr(_).get)  //27
}

The observations are extracted from the volatility and volume time series (line 25). The predictive method for the fnRegr ridge regression (line 26) is applied to each observation (line 27). The RSS value, rss, is plotted for different values of λ, as shown in the following chart:

The graph of RSS versus lambda for the Copper ETF

The residual sum of squares decreases as λ increases. The curve seems to be reaching for a minimum around λ = 1. The case of λ = 0 corresponds to the least squares regression.

Next, let's plot the RSS value for λ varying between 1 and 100:

The graph of RSS versus a large value Lambda for the Copper ETF

This time around, the value of RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings [6:12]. As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases.

Let's plot the predicted price variation of the Copper ETF using the ridge regression with different values of lambda (λ):

The graph of ridge regression on the Copper ETF price variation with a variable, lambda

The original price variation of the Copper ETF, Δ = price(t + 1) - price(t), is plotted as λ = 0. Let's analyze the behavior of the predictive model for different values of λ:

The predicted values for λ = 0.8 is very similar to the original data.
The predicted values for λ = 2 follow the pattern of the original data with a reduction of large variations (peaks and troves).
The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced.

The logistic regression, which was briefly introduced in the Let's kick the tires section in Chapter 1, Getting Started, is the next logical regression model to be discussed. The logistic regression relies on optimization methods. Let's go through a short refresher course in optimization before diving into the logistic regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Regularization

Create new playlist

Sign In

Sign Up

Regularization

Ln roughness penalty

Note

Note

Note

Ridge regression

Note

Design

Implementation

Test case

Table of Contents for
Regularization

L_n roughness penalty