Ridge regression

With ridge regression, we will have all the eight features in the model, so this will be an intriguing comparison with the best subsets model. The package that we will use and is in fact already loaded, is glmnet. The package requires that the input features are in a matrix instead of a data frame and for ridge regression, we can follow the command sequence of glmnet(x = our input matrix, y = our response, family = the distribution, alpha=0). The syntax for alpha relates to 0 for ridge regression and 1 for doing LASSO.

To get the train set ready for use in glmnet is actually quite easy by using as.matrix() for the inputs and creating a vector for the response, as follows:

    > x <- as.matrix(train[, 1:8])
    > y <- train[, 9]

Now, run the ridge regression by placing it in an object called, appropriately I might add, ridge. It is important to note here that the glmnet package will first standardize the inputs before computing the lambda values and then will unstandardize the coefficients. You will need to specify the distribution of the response variable as gaussian as it is continuous and alpha=0 for ridge regression, as follows:

    > ridge <- glmnet(x, y, family = "gaussian", alpha = 0)

The object has all the information that we need in order to evaluate the technique. The first thing to try is the print() command, which will show us the number of nonzero coefficients, percent deviance explained, and correspondent value of Lambda. The default number in the package of steps in the algorithm is 100. However, the algorithm will stop prior to 100 steps if the percent deviation does not dramatically improve from one lambda to another; that is, the algorithm converges to an optimal solution. For the purpose of saving space, I will present only the following first five and last ten lambda results:

    > print(ridge)
    Call:  glmnet(x = x, y = y, family = "gaussian", alpha = 0) 
           Df      %Dev    Lambda
      [1,]  8 3.801e-36 878.90000
      [2,]  8 5.591e-03 800.80000
      [3,]  8 6.132e-03 729.70000
      [4,]  8 6.725e-03 664.80000
      [5,]  8 7.374e-03 605.80000
      ...........................
     [91,]  8 6.859e-01   0.20300
     [92,]  8 6.877e-01   0.18500
     [93,]  8 6.894e-01   0.16860
     [94,]  8 6.909e-01   0.15360
     [95,]  8 6.923e-01   0.13990
     [96,]  8 6.935e-01   0.12750
     [97,]  8 6.946e-01   0.11620
     [98,]  8 6.955e-01   0.10590
     [99,]  8 6.964e-01   0.09646
    [100,]  8 6.971e-01   0.08789

Look at row 100 for example. It shows us that the number of nonzero coefficients or-said another way-the number of features included, is eight; please recall that it will always be the same for ridge regression. We also see that the percent of deviance explained is .6971 and the Lambda tuning parameter for this row is 0.08789. Here is where we can decide on which lambda to select for the test set. The lambda of 0.08789 can be used, but let's make it a little simpler, and for the test set, try 0.10. A couple of plots might help here so let's start with the package's default, adding annotations to the curve by adding label=TRUE in the following syntax:

    > plot(ridge, label = TRUE)

The following is the output of the preceding command:

In the default plot, the y axis is the value of Coefficients and the x axis is L1 Norm. The plot tells us the coefficient values versus the L1 Norm. The top of the plot contains a second x axis, which equates to the number of features in the model. Perhaps a better way to view this is by looking at the coefficient values changing as lambda changes. We just need to tweak the code in the following plot() command by adding xvar="lambda". The other option is the percent of deviance explained by substituting lambda with dev:

    > plot(ridge, xvar = "lambda", label = TRUE)

The output of the preceding command is as follows:

This is a worthwhile plot as it shows that as lambda decreases, the shrinkage parameter decreases and the absolute values of the coefficients increase. To see the coefficients at a specific lambda value, use the coef() command. Here, we will specify the lambda value that we want to use by specifying s=0.1. We will also state that we want exact=TRUE, which tells glmnet to fit a model with that specific lambda value versus interpolating from the values on either side of our lambda, as follows:

    > ridge.coef <- coef(ridge, s = 0.1, exact = TRUE)
    > ridge.coef
    9 x 1 sparse Matrix of class "dgCMatrix"
                          1
    (Intercept)  0.13062197
    lcavol       0.45721270
    lweight      0.64579061
    age         -0.01735672
    lbph         0.12249920
    svi          0.63664815
    lcp         -0.10463486
    gleason      0.34612690
    pgg45        0.00428580

It is important to note that age, lcp, and pgg45 are close to, but not quite, zero. Let's not forget to plot deviance versus coefficients as well:

    > plot(ridge, xvar = "dev", label = TRUE)

The output of the preceding command is as follows:

Comparing the two previous plots, we can see that as lambda decreases, the coefficients increase and the percent/fraction of the deviance explained increases. If we were to set lambda equal to zero, we would have no shrinkage penalty and our model would equate the OLS.

To prove this on the test set, we will have to transform the features as we did for the training data:

    > newx <- as.matrix(test[, 1:8])

Then, we use the predict function to create an object that we will call ridge.y with type = "response" and our lambda equal to 0.10 and plot the Predicted values versus the Actual values, as follows:

    > ridge.y <- predict(ridge, newx = newx, type = "response", s = 
      0.1)
    > plot(ridge.y, test$lpsa, xlab = "Predicted", ylab = "Actual",main 
      = "Ridge Regression")

The output of the following command is as follows:

The plot of Predicted versus Actual of Ridge Regression seems to be quite similar to best subsets, complete with two interesting outliers at the high end of the PSA measurements. In the real world, it would be advisable to explore these outliers further so as to understand whether they are truly unusual or we are missing something. This is where domain expertise would be invaluable. The MSE comparison to the benchmark may tell a different story. We first calculate the residuals, and then take the mean of those residuals squared:

    > ridge.resid <- ridge.y - test$lpsa
    > mean(ridge.resid^2)
    [1] 0.4789913

Ridge regression has given us a slightly better MSE. It is now time to put LASSO to the test to see if we can decrease our errors even further.

Table of Contents for Ridge regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Ridge regression