Principal component regression

How do we perform principal component regression in R? We don't have a lot of pain to suffer here, thanks to a simple function from the pls package by Bjørn-Helge Mevik and Ron Wehrens. This package provides facilities both for the application of principal component regression and partial least squares in R. We are not going to apply the second to our data, but you should be aware of its existence as an alternative to the ordinary least squares technique we applied for coefficient estimation.

The simple function I was mentioning is the pcr() one. It is really similar to the lm() function we have already employed, and just requires you to pass the response variable and explanatory variables you are going to employ in your principal component regression model:

pcr_regression <- pcr(as.numeric(default_numeric)~., data = training_data)

Let's have a look inside by calling summary:

summary(pcr_regression)

What can you see here?

We have got some descriptive information about the data, particularly that our X matrix is composed of 12 columns, that is variables, and 11,523 rows, and our Y vector is composed of one column and the same number of rows. You then find information about the algorithm employed, which is the singular value decomposition (I will point you to some good references if you want to get some more information on this), and finally a really interesting table on the % of variance explained from different sets of components.

How do you read it?

You can see that by employing just the first principal component, the first column on the left of the table, you have the 97.50 of variance of X explained, while only the 0.22 of variance of Y is explained. Moving on with the number of components, you can see that when employing six principal components we have explained all the variance of X, yet the variance of Y remains mainly unexplained. This is actually true until the end, when the threshold of twelve components is reached.

As I was saying, it doesn't make any sense to go further, since our starting set has twelve variables and we actually want to reduce the dimensions of our  model.

We can conclude that no great improvement was obtained through this technique. Nevertheless, we still have one more way to visualize its results, which is to plot the R-squared level associated with each set of components tried. We can conveniently do this through the R2() function, provided directly within the pls() package. A full set of functionalities like those that are actually provided within the package, and we are going to have closer look at them when talking about performance metrics. For now, let us just plot our R-squared:

plot(R2(pcr_regression))

As you can see, we get here a substantial improvement of our R-squared as the number of components improves. You can notice that this metric holds stable from one to four components, then substantially improves around five components, and then has another relevant piece of improvement around 10 components. What is the final level? You can find the number by employing, once again, the R2() function:

R2(pcr_regression)

This shows us the following:

As you can see, we finally come to a 0.08 R-squared, which up to one is not an impressive value. Nevertheless, keep it in mind to compare it with the result we will get from the stepwise regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset