The ordinary least squares method for finding the regression parameters is a specific case of the maximum likelihood. Therefore, regression models are subject to the same challenge in terms of overfitting as any other discriminative model. You are already aware that regularization is used to reduce model complexity and avoid overfitting as stated in the Overfitting section of Chapter 2, Hello World!.
Regularization consists of adding a penalty function J(w)
to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage involves adding a function with model parameters as an argument to the loss function:
The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to the function of the norm of the model parameters (or weights), wd
. For a model of D dimensions, the generic Lp-norm is defined as follows:
The two most commonly used penalty functions for regularization are L1 and L2.
Regularization in machine learning
The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as a support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS.
The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty.
You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularization differ in terms of computation efficiency, estimation, and features selection: [6:10] [6:11]
Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor.
The ridge regression is a multivariate linear regression with an L2-norm penalty term:
The computation of the ridge regression parameters requires the resolution of a system of linear equations similar to the linear regression.
The implementation of the ridge regression adds the L2 regularization term to the multiple linear regression computation of the Apache Commons Math library.
The methods of RidgeRegression
have the same signature as their ordinary least squares counterparts. However, the class has to inherit the abstract base class, AbstractMultipleLinearRegression
, in the Apache Commons Math library and override the generation of the QR decomposition to include the penalty term:
class RidgeRegression[T <% Double](xt: XTSeries[Array[T]], y: DblVector, lambda: Double) extends AbstractMultipleLinearRegression with PipeOperator[Array[T], Double] { var qr: QRDecomposition = _ val model: Option[RegressionModel] = … …
Besides the input time series xt
and the labels y
, the ridge regression requires the lambda
factor of the L2 penalty term. The instantiation of the class trains the model
. The steps to create the ridge regression models are as follows:
newXSampleData
(line 1
).calculateBeta
defined in the base class (line 2
).calculateBeta
, and the residuals, calculateResiduals
.val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) }
The QR decomposition in the base class, AbstractMultipleLinearRegression
, does not include the penalty term (line 3
); the identity matrix with the lambda
factor in the diagonal has to be added to the matrix to be decomposed (line 4
):
override protected def newXSampleData(x: DblMatrix): Unit = {
super.newXSampleData(x) //3
val xtx: RealMatrix = getX
val nFeatures = xt(0).size
Range(0, nFeatures)
.foreach(i =>xtx.setEntry(i,i,xtx.getEntry(i,i)+lambda)) //4
qr = new QRDecomposition(xtx)
}
The regression weights are computed by resolving the system of linear equations using substitution on the Q.R matrices. It overrides calculateBeta
from the base class:
override protected def calculateBeta: RealVector =
qr.getSolver().solve(getY())
The objective of the test case is to identify the impact of the L2 penalization on the RSS value and then compare the predicted values with the original values.
Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as features. The implementation of the extraction of observations is identical to that of the least squares regression:
val lambda = 0.5 val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1 val deltaPrice = XTSeries[Double](price.drop(1) .zip(price.take(_price.size -1)) .map( z => z._1 - z._2)) //2 val data = volatility.zip(volume) .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.dropRight(1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4 regression.rss match { case Some(rss) => Display.show(rss, logger) …
The observed data, that is, the ETF daily price
and the features (volatility
and volume
) are extracted from the src
source (line 1
). The daily price change deltaPrice
is computed using a combination of Scala take
and drop
methods (line 2
). The features
vector is created by zipping volatility
and volume
(line 3
). The model is created by instantiating the RidgeRegression
class (line 4
). The RSS value, rss
, is finally displayed (line 5
).
The RSS value, rss
, is plotted for different values of lambda less than 1.0, as shown in the following chart:
The residual sum of squares decreases as λ increases. The curve seems to be reaching for a minimum around λ = 1. The case of λ = 0 corresponds to the least squares regression.
Next, let's plot the RSS value for λ varying between 1 and 100:
This time around, the value of RSS increases with λ before reaching a maximum of λ > 60. This behavior is consistent with other findings [6:12]. As λ increases, the overfitting gets more expensive and therefore, the RSS value increases.
The regression weights
can be simply outputted as follows:
regression.weights.get
Let's plot the predicted price variation of the Copper ETF using the ridge regression with different values of lambda (λ):
The original price variation of the Copper ETF, Δ = price(t+1)-price(t), is plotted as λ = 0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 2 follow the pattern of the original data with a reduction of large variations (peaks and troves). The predicted values for λ = 5 correspond to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced.
The logistic regression, briefly introduced in the Let's kick the tires section of Chapter 1, Getting Started, is the next logical regression model to discuss. The logistic regression relies on optimization methods. Let's go through a short refreshment course in optimization before diving into the logistic regression.