6
Ridge Regression in Theory and Applications

The multiple linear regression model is one of the best known and widely used among the models for statistical data analysis in every field of sciences and engineering as well as in social sciences, economics, and finance. The subject of this chapter is the study of the rigid regression estimator (RRE) for the regression coefficients, its characteristic properties, and comparing its relation with the least absolute shrinkage and selection operator (LASSO). Further, we consider the preliminary test estimator (PTE) and the Stein‐type ridge estimator in low dimension and study their dominance properties. We conclude the chapter with the asymptotic distributional theory of the ridge estimators following Knight and Fu (2000).

6.1 Multiple Linear Model Specification

Consider the multiple linear model with coefficient vector, images given by

(6.1)equation

where images is a vector of images responses, images is an images design matrix of rank images, and images is an images‐vector of independently and identically distributed (i.i.d.) random variables with distribution images, with images, the identity matrix of order images.

6.1.1 Estimation of Regression Parameters

Using the model (6.1) and the error distribution of images we obtain the maximum likelihood estimator/least squares estimator (MSE/LSE) of images by minimizing

(6.2)equation

to get images, where images, images.

Sometimes the method is written as

(6.3)equation

where images, and images where images and images are images and images submatrices, respectively, images of dimension images stands for the main effect, and images of dimension images stands for the interactions; and we like estimate images. Here, images.

In this case, we may write

(6.4)equation

where images, images. Hence

equation

where

equation

respectively.

It is well known that images and the covariance matrix of images is given by

(6.5)equation

Further, an unbiased estimator of images is given by

(6.6)equation

Using normal theory it may be shown that images are statistically independent and images and images follows a central chi‐square distribution with images degrees of freedom (DF).

The images risk of images, any estimator of images, is defined by

equation

Then, the images risk of images is given by

(6.7)equation

If images be the eigenvalues of images, then we may write

(6.8)equation

Similarly, one notes that

(6.9)equation

In a similar manner, images.

Now, we find the images and images as given below

(6.10)equation

Similarly, we obtain

(6.11)equation

6.1.2 Test of Hypothesis for the Coefficients Vector

Suppose we want to test the null‐hypothesis images vs. images. Then, we use the test statistic (likelihood ratio test),

(6.12)equation

which follows a noncentral images‐distribution with images DF and noncentrality parameter, images, defined by

(6.13)equation

Similarly, if we want to test the subhypothesis images vs. images, then one may use the test statistic

(6.14)equation

where images, images and

(6.15)equation

Note that images follows a noncentral images‐distribution with images DF and noncentrality parameter images given by

(6.16)equation

Let images be the upper images‐level critical value for the test of images, then we reject images whenever images or images; otherwise, we accept images.

6.2 Ridge Regression Estimators (RREs)

From Section 6.1.1, we may see that

  1. images
  2. cov‐matrix of images is images.

Hence, images‐risk function of images is images. Further, when images, the variance of images is images.

Set images. Then, it is seen that the lower bound of the images risk and the variance are images and images, respectively. Hence, the shape of the parameter space is such that reasonable data collection may result in a images matrix with one or more small eigenvalues. As a result, images distance of images to images will tend to be large. In particular, coefficients tend to become large in absolute value and may even have wrong signs and be unstable. As a result, such difficulties increase as more prediction vectors deviate from orthogonality. As the design matrix, images deviates from orthogonality, images becomes smaller and images will be farther away from images. The ridge regression method is a remedy to circumvent these problems with the LSE.

For the model (6.1), Hoerl and Kennard (1970) defined the RRE of images as

(6.17)equation

The basic idea of this type is from Tikhonov (1963) where the tuning parameter, images, is to be determined from the data images. The expression (6.17) may be written as

(6.18)equation

If images is invertible, whereas if images is ill‐conditioned or near singular, then (6.17) is the appropriate expression for the estimator of images.

The derivation of (6.17) is very simple. Instead of minimizing the LSE objective function, one may minimize the objective function where images is the penalty function

(6.19)equation

with respect to (w.r.t.) images yielding the normal equations,

(6.20)equation

Here, we have equal weight to the components of images.

6.3 Bias, MSE, and images Risk of Ridge Regression Estimator

In this section, we consider bias, mean squared error (MSE), and images‐risk expressions of the RRE.

First, we consider the equal weight RRE, images. The bias, MSE, and images risk are obtained as

(6.21)equation
(6.22)equation
(6.23)equation

respectively.

One may write

(6.24)equation

Let images and images be the imagesth eigenvalue of images and images, respectively. Then

(6.25)equation

images.

Note that

  1. images, for images, is shorter than images, that is
    (6.26)equation
  2. (6.27)equation
    where images.

For the estimate of images, the residual sum of squares (RSS) is

(6.28)equation

The expression shows that the RSS in (6.28) is equal to the total sum of squares due to images with a modification upon the squared length of images.

Now, we explain the properties of images risk of the RSS given in (6.28), which may be written as

(6.29)equation

Figure 6.1, the graph of the images risk, depicts the two components as a function of images. It shows the relationships between variances, the quadratic bias and images, the tuning parameter.

Graphical curve depicting L2 risk of two components, LSE and RRE, as a function of k.

Figure 6.1 Graph of images risk of RRE in case of images.

Total variance decreases as images increases, while quadratic bias increases with images. As indicated by the dotted line graph representing RRE images risk, there are values of images for which images is less than or equal to images.

It may be noted that images is a monotonic decreasing function of images, which images is a monotonic increasing function of images. Now, find the derivatives of these two functions at the origin (images) as

(6.30)equation
(6.31)equation

Hence, images has a negative derivative, which tends to images as images for an orthogonal images and approaches images as images becomes ill‐conditioned and images, while as images, images is flat and zero at images. These properties indicate that if we choose images appropriately, our estimator will inherit a little bias and substantially reduce variances, and thereby the images‐risk will decrease to improve the prediction.

We now prove the existence of images to validate the commentaries put forward by the following theorems.

A direct consequence of the given result is the following existence theorem.

6.4 Determination of the Tuning Parameters

Suppose images is nonsingular, then the LSE of images is images. Let images be the equal weight RRE of images deleting the imagesth data point images. If images is chosen properly, then the imagesth component of images predicts images well. The generalized cross‐validation (GCV) is defined as the weight average of predicted square errors

(6.35)equation

where images and images is the imagesth diagonal element of images. See Golub et al. (1979) for more details.

A computationally efficient version of the GCV function of the RRE is

(6.36)equation

The GCV estimator of images is then given by

(6.37)equation

The GCV theorem (Golub et al. 1979) guarantees the asymptotic efficiency of the GCV estimator under images (and also images) setups.

We later see that the statistical estimator of images is given by images, where images.

6.5 Ridge Trace

Ridge regression has two interpretations of its characteristics. The first is the ridge trace. It is a two‐dimensional plot of images and the RSS given by

equation

for a number of values of images. The trace serves to depict the complex relationships that exist between nonorthogonal prediction vectors and the effect of these interrelationships on the estimation of images. The second is the way to estimate images that gives a better estimator of images by damping the effect of the lower bound mentioned earlier. Ridge trace is a diagnostic test that gives a readily interpretable picture of the effects of nonorthogonality and may guide to a better point estimate.

Now, we discuss how ridge regression can be used by statisticians with the gasoline mileage data from Montgomery et al. (2012). The data contains the following variables: images = miles/gallon, images = displacement (cubic in.), images = horsepower (ft‐lb), images = torque (ft‐lb), images = compression ratio, images = rear axle ratio, images = carburetor (barrels), images = number of transmission speeds, images = overall length, images = images = width (in.), images = weight (lb), images = type of transmission.

Table 6.1 is in correlation format of images and images, where images is centered.

Table 6.1 Correlation coefficients for gasoline mileage data.

images images images images images images images images images images images images
images   1.0 images0.8 images0.7 images0.8   0.4   0.6 images0.4   0.7 images0.7 images0.7 images0.8 images0.7
images images0.8   1.0   0.9   0.9 images0.3 images0.6   0.6 images0.7   0.8   0.7   0.9   0.8
images images0.7   0.9   1.0   0.9 images0.2 images0.5   0.7 images0.6   0.8   0.7   0.8   0.7
images images0.8   0.9   0.9   1.0 images0.3 images0.6   0.6 images0.7   0.8   0.7   0.9   0.8
images   0.4 images0.3 images0.2 images0.3   1.0   0.4   0.0   0.5 images0.3 images0.3 images0.3 images0.4
images   0.6 images0.6 images0.5 images0.6   0.4   1.0 images0.2   0.8 images0.5 images0.4 images0.5 images0.7
images images0.4   0.6   0.7   0.6   0.0 images0.2   1.0 images0.2   0.4   0.3   0.5   0.3
images   0.7 images0.7 images0.6 images0.7   0.5   0.8 images0.2   1.0 images0.6 images0.6 images0.7 images0.8
images images0.7   0.8   0.8   0.8 images0.3 images0.5   0.4 images0.6   1.0   0.8   0.9   0.6
images images0.7   0.7   0.7   0.7 images0.3 images0.4   0.3 images0.6   0.8   1.0   0.8   0.6
images images0.8   0.9   0.8   0.9 images0.3 images0.5   0.5 images0.7   0.9   0.8   1.0   0.7
images images0.7   0.8   0.7   0.8 images0.4 images0.7   0.3 images0.8   0.6   0.6   0.7   1.0

We may find that the eigenvalues of the matrix images are given by images, images, images, images, images, images, images, images, images, images, and images.

The condition number is images, which indicates serious multicollinearity. They are nonzero real numbers and images, which is 1.6 times more than 10 in the orthogonal case. Thus, it shows that the expected distance of the BLUE (best linear unbiased estimator), images from images is images. Thus, the parameter space of images is images‐dimensional, but most of the variations are due to the largest two eigenvalues.

Variance inflation factor (VIF): First, we want to find VIF, which is defined as

equation

where images is the coefficient of determination obtained when images is regressed on the remaining images regressors. A images greater than 5 or 10 indicates that the associated regression coefficients are poor estimates because of multicollinearity.

The VIF of the variables in this data set are as follows: images, images, images, images, images, images, images, images, images, images, and images. These VIFs certainly indicate severe multicollinearity in the data. It is also evident from the correlation matrix in Table 6.1. This is the most appropriate data set to analyze ridge regression.

On the other hand, one may look at the ridge trace and discover many finer details of each factor and the optimal value of images, the tuning parameter of the ridge regression errors. The ridge trace gives a two‐dimensional portrayal of the effect of factor correlations and making possible assessments that cannot be made even if all regressions are computed. Figure 6.2 depicts the analysis using the ridge trace.

Graph depicting the analysis using the ridge trace that gives a two-dimensional portrayal for gasoline mileage data.

Figure 6.2 Ridge trace for gasoline mileage data.

Notice that the absolute value of each factor tends to the LSE as images goes to zero. From Figure 6.2 we can see that reasonable coefficient stability is achieved in the range images. For more comments, see Hoerl and Kennard (1970).

6.6 Degrees of Freedom of RRE

RRE constraints the flexibility of the model by adding a quadratic penalty term to the least squares objective function (see Section 6.2).

When images, the RRE equals the LSE. The larger the value of images, the greater the penalty for having large coefficients.

In our study, we use images‐risk expression to assess the overall quality of an estimator. From (6.7) we see that the images risk of the LSE, images is images and from (6.23) we find the images risk of RRE is

equation

These expressions give us the distance of LSE and RRE from the true value of images, respectively. LSE is a BLUE of images. But RRE is a biased estimator of images. It modifies the LSE by introducing a little bias to improve its performance. The RRE existence in Theorem 6.2 shows that for images, images. Thus, for an appropriate choice of images, RRE is closer to images than to the LSE. In this sense, RRE is more reliable than the LSE.

On top of the abovementioned property, RRE improves the model performance over the LSE and variable subset selection. RRE also reduces the model DF For example, LSE has images parameters and therefore uses images DF.

To see how RRE reduces the number of parameters, we define linear smoother given by

(6.38)equation

where the linear operator images is a smoother matrix and images contains the predicted value of images. The DF of images is given by the trace of the smoother matrix, images. The LSE and RRE are both linear smoother, since the predicted values of either estimators are given by the product

(6.39)equation

where images is the hat matrix of the RRE. Let images be the hat matrix of the LSE, then images is the DF of LSE.

To find the DF of RRE, let images, images, images be the eigenvalue matrix corresponding to images. We consider the singular value decomposition (SVD) of images, where images and images are column orthogonal matrices; then

(6.40)equation

where images are called the shrinkage fractions. When images is positive, the shrinkage fractions are all less than 1 and, hence, images. Thus, the effective number of parameters in a ridge regression model is less than the actual number of predictors. The variable subset selection method explicitly drops the variables from the model, while RRE reduces the effects of the unwanted variables without dropping them.

A simple way of finding a value of images is to set the equation images and solve for images. It is easy to see that

(6.41)equation

Then, solve images. Hence, an optimum value of images falls in the interval

equation

such that images.

6.7 Generalized Ridge Regression Estimators

In Sections 6.2 and 6.3, we discussed ordinary RREs and the associated bias and images‐risk properties. In this RRE, all coefficients are equally weighted; but in reality, the coefficients deserve unequal weight. To achieve this goal, we define the generalized ridge regression estimator (GRRE) of images as

(6.42)equation

where images, images, images. If images, we get (6.17). One may derive (6.38) by minimizing the objective function with images as the penalty function

(6.43)equation

w.r.t. images to obtain the normal equations

equation

Here, we have put unequal weight on the components of images.

The bias, MSE, and images‐risk expressions of the GRRE given by

(6.44)equation
(6.45)equation
(6.46)equation

respectively.

In the next section, we show the application of the GRRE to obtain the adaptive RRE.

6.8 LASSO and Adaptive Ridge Regression Estimators

If the images norm in the penalty function of the LSE objective function

equation

is replaced by the images norm images, the resultant estimator is

(6.47)equation

This estimator makes some coefficients zero, making the LASSO estimator different from RRE where most of coefficients become small but nonzero. Thus, LASSO simultaneously selects and estimates at the same time.

The LASSO estimator may be written explicitly as

(6.48)equation

where images is the imagesth diagonal element of images and images.

Now, if we consider the estimation of images in the GRRE, it may be easy to see that if images, images, images, where images is an orthogonal matrix; see Hoerl and Kennard (1970).

To avoid simultaneous estimation of the weights images, we assume that images in (6.47) is equal to the harmonic mean of images, i.e.

(6.49)equation

The value of images controls the global complexity. This constraint is the link between the images tuning parameters of images and tuning parameter images of images.

The adaptive generalized ridge regression estimator (AGRRE) is obtained by minimizing the following objective function:

(6.50)equation

which is the Lagrangian form of the objective function

(6.51)equation

and images and images are the Lagrangian for (6.50).

Following Avalos et al. (2007), a necessary condition for the optimality is obtained by deriving the Lagrangian w.r.t. images, given by

(6.52)equation

Putting the expression in the constraint (6.49), we obtain

(6.53)equation

The optimal images is then obtained from images and images's so that (6.51) may be written as

(6.54)equation

which is equivalent to minimizing

(6.55)equation

for some images, which is exactly LASSO problem.

Minimizing (6.55), we obtain images normal equations for images, as

(6.56)equation

The solution (6.56) may be obtained following Grandvalet (1998) as given here:Let

(6.57)equation

Then, the problem (6.51) with constraint (6.52) may be written as

(6.58)equation

where images with images, subject to images, images, images.

After some tedious algebra, as in Avalos et al. (2007), we obtain the RRE of images as

(6.59)equation
(6.60)equation

In the next section, we present the algorithm to obtain LASSO solutions.

6.9 Optimization Algorithm

For the computation of LASSO estimators, Tibshirani (1996) used quadratic programming. In this section, we use a fixed point algorithm from the expression (6.51) which is the GRRE. Thus, we solve a sequence of weighted ridge regression problems, as suggested by Knight and Fu (2000). First, we estimate images‐matrix based on the estimators of images as well as images. For the estimator of images, we use from Hoerl et al. (1975):

(6.61)equation

where

equation

Hence, the estimate of the diagonal matrix images is given by images, where

(6.62)equation

Alternatively, we may use the estimator, images, where

(6.63)equation

The derivation of this estimator is explained here.

We consider the LASSO problem, which is

(6.64)equation

Let images be the LASSO solution. For any images, such that images, the optimality conditions are

(6.65)equation

where images is the imagesth column of images.

Take the equivalent adaptive ridge problem:

(6.66)equation

whose solution is images (since the problems are equivalent). For any images, such that images, the optimality conditions are

(6.67)equation

Hence, we have, for all images, such that images:

(6.68)equation

and since (see (6.49))

(6.69)equation

we have

(6.70)equation

To obtain the LASSO estimator, we start the iterative process based on (6.61) as given here:

(6.71)equation

We can define successive estimates by

(6.72)equation

where

(6.73)equation

with images is the diagonal matrix with some elements as zero. The expression (6.73) is similar to Knight and Fu (2000).

The sequence images does not necessarily converge to the global minimum, but seems to work well if multiple starting points are used. The resulting LASSO estimator produces images nonzero coefficients and images zero coefficients such that images. This information will be useful to define the PTE and the Stein‐type shrinkage estimator to assess the performance characteristics of the LASSO estimators.

An estimator of the MSE matrix is given by

(6.74)equation

where some elements of images are zero.

6.9.1 Prostate Cancer Data

The prostate cancer data attributed to Stamey et al. (1989) was given by Tibshirani (1996) for our analysis. They examined the correlation between the level of the prostate‐specific antigen and a number of clinical measures in men who were about to undergo a radical prostatectomy. The factors were considered as follows: log(cancer volume) (lcavol), log(prostate weight) (l weight), age, log(benign prostatic hyperplasia amount) (lbph), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score (gleason), and precebathe Gleason scores 4 or 5 (pgg45). Tibshirani (1996) fitted a log linear model to log(prostate specific antigen) after first standardizing the predictors. Our data consists of the sample size 97 instead of 95 considered by Tibshirani (1996). The following results (Table 6.2) have been obtained for the prostate example.

Table 6.2 Estimated coefficients (standard errors) for prostate cancer data using LS, LASSO, and ARR estimators.

LSE (s.e.) LASSO (s.e.) ARRE (s.e)
2.47 (0.07) 2.47 (0.07) 2.47 (0.07)
0.66 (0.10) 0.55 (0.10) 0.69 (0.00)
0.26 (0.08) 0.17 (0.08) 0.15 (0.00)
images0.15 (0.08) 0.00 (0.08) 0.00 (0.00)
0.14 (0.08) 0.00 (0.08) 0.00 (0.00)
0.31 (0.10) 0.18 (0.10) 0.13 (0.00)
images0.14 (0.12) 0.00 (0.12) 0.00 (0.00)
0.03 (0.11) 0.00 (0.12) 0.00 (0.00)
0.12 (0.12) 0.00 (0.12) 0.00 (0.00)

LS, least specific; ARR, adaptive ridge regression.

6.10 Estimation of Regression Parameters for Low‐Dimensional Models

Consider the multiple regression model

(6.75)equation

with the images design matrix, images, where images.

6.10.1 BLUE and Ridge Regression Estimators

The LSE of images is the value (or values) of images for which the images‐norm images is least. In the case where images and images has rank images, the images matrix images is invertible and the LSE of images is unique and given by

(6.76)equation

If images is not of rank images, (6.76) is no longer valid. In particular, (6.76) is not valid when images.

Now, we consider the model

(6.77)equation

where images is images (images) matrix and suspect that the sparsity condition images may hold. Under this setup, the unrestricted ridge estimator (URE) of images is

(6.78)equation

Now, we consider the partition of the LSE, images, where images is a images‐vector, images so that images. Note that images and images are given by (6.4). We know that the marginal distribution of images and images, respectively. Thus, we may define the corresponding RREs as

(6.79)equation
(6.80)equation

respectively, for images and images

If we consider that images holds, then we have the restricted regression parameter images, which is estimated by the restricted ridge regression estimator (RRRE)

equation

where

(6.81)equation

On the other hand, if images is suspected to be images, then we may test the validity of images based on the statistic,

(6.82)equation

where images is assumed to be known. Then, images follows a chi‐squared distribution with images DF under images. Let us then consider an images‐level critical value images from the null distribution of images.

Define the PTE as

(6.83)equation

Similarly, we may define the SE as

(6.84)equation

and the positive‐rule Stein‐type estimator (PRSE) as

(6.85)equation

6.10.2 Bias and images‐risk Expressions of Estimators

In this section, we present the bias and images risk of the estimators as follows:

  1. Unrestricted ridge estimator (URE)
    (6.86)equation

    The images‐risk of images is obtained as follows:

    (6.87)equation

    where images. Then the weighted risk of images, with weight images is obtained as

    (6.88)equation

    Similarly, the risk of images is obtained as follows:

    (6.89)equation

    where images. Then, the weighted risk of images, with weight images is obtained as

    (6.90)equation

    Finally, the weighted images risk of images

    (6.91)equation
  2. Restricted ridge estimator (RRRE)
    (6.92)equation

    Then, the images‐risk of images is

    (6.93)equation

    Now consider the following weighted risk function:

    equation

    Thus,

    (6.94)equation
  3. Preliminary test ridge regression estimator (PTRRE)

    The bias and weighted risk functions of PTRE are respectively obtained as follows:

    (6.95)equation

    The weighted (weight, images is obtained as follows:

    (6.96)equation
  4. Stein‐type ridge regression estimator (SRRE)

    The bias and weighted risk functions of SRRE are respectively obtained as follows:

    (6.97)equation

    The weighted (images) risk function of the Stein‐type estimator is obtained as follows

    (6.98)equation
  5. Positive‐rule Stein‐type ridge estimator (PRSRRE)

    The bias and weighted (images) risk functions of PTSRRE are, respectively, obtained as follows:

    (6.99)equation

    where

    equation

6.10.3 Comparison of the Estimators

Here, we compare the URE, SRE, and PRSRRE using the weighted images‐risks criterion given in Theorem 6.3.

6.10.4 Asymptotic Results of RRE

First, we assume that

(6.103)equation

where for each images, images are i.i.d. random variables with mean zero and variance images. Also, we assume that imagess satisfy

(6.104)equation

for some positive definite matrix images and

(6.105)equation

Suppose that

(6.106)equation

and define images to minimize

(6.107)equation

Then, we have the following theorem from Knight and Fu (2000).

This theorem suggests that the advantages of ridge estimators are limited to situations where all coefficients are relatively small.

The next theorem gives the asymptotic distribution of images.

From Theorem 6.5,

equation

where images.

6.11 Summary and Concluding Remarks

In this chapter, we considered the unrestricted estimator and shrinkage estimators, namely, restricted estimator, PTE, Stein‐type estimator, PRSE and two penalty estimators, namely, RRE and LASSO for estimating the regression parameters of the linear regression models. We also discussed about the determination of the tuning parameter. A detailed discussion on LASSO and adaptive ridge regression estimators are given. The optimization algorithm for estimating the LASSO estimator is discussed, and prostate cancer data are used to illustrate the optimization algorithm.

Problems

  1. 6.1 Show that images approaches singularity as the minimum eigenvalue tends to 0.
  2. 6.2 Display the graph of ridge trace for the Portland cement data in Section 6.5.
  3. 6.3 Consider the model, images and suppose that we want to test the null‐hypothesis images (vs.) images. Then, show that the likelihood ratio test statistic is
    equation

    which follows a noncentral images‐distribution with images DF and noncentrality parameter, images.

  4. 6.4 Similarly, if we want to test the subhypothesis images vs. images, then show that the appropriate test statistic is,
    equation

    where

    equation

    and images, images.

  5. 6.5 Show that the risk function for GRRE is
    (6.112)equation
  6. 6.6 Verify (6.60).
  7. 6.7 Verify (6.74).
  8. 6.8 Consider a real data set, where the design matrix elements are moderate to highly correlated, then find the efficiency of the estimators using unweighted risk functions. Find parallel formulas for the efficiency expressions and compare the results with that of the efficiency using weighted risk function. Are the two results consistent?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset