Chapter 9. Principal Components Analysis

 

"Some people skate to the puck. I skate to where the puck is going to be."

 
 --Wayne Gretzky

This chapter is the second one where we will focus on the unsupervised learning techniques. In the prior chapter, we covered cluster analysis, which provides us with the groupings of similar observations. In this chapter, we will see how to reduce the dimensionality and improve the understanding of our data by grouping the correlated variables with Principal Components Analysis (PCA). Then, we will use the principal components in supervised learning.

In many datasets, particularly in the social sciences, you will see many variables highly correlated with each other. It may additionally suffer from high dimensionality or, as it is known, the curse of dimensionality. This is a problem because the number of samples needed to estimate a function grows exponentially with the number of input features. In such datasets, there may be the case that some variables are redundant as they end up measuring the same constructs, for example, income and poverty or depression and anxiety. The goal then is to use PCA in order to create a smaller set of variables that capture most of the information from the original set of variables, thus simplifying the dataset and often leading to hidden insights. These new variables (principal components) are highly uncorrelated with each other. In addition to supervised learning, it is also very common to use these components to perform data visualization.

From over a decade of either doing or supporting analytics using PCA, it has been my experience that it is widely used but poorly understood, especially among people who don't do the analysis but consume the results. It is intuitive to understand that you are creating a new variable from the other correlated variables. However, the technique itself is shrouded in potentially misunderstood terminology and mathematical concepts that often bewilder the layperson. The intention here is to provide a good foundation on what it is and how to use it.

An overview of the principal components

PCA is the process of finding the principal components. What exactly are these? We can consider that a component is a normalized linear combination of the features. (James, 2012) The first principal component in a dataset is the linear combination that captures the maximum variance in the data. A second component is created by selecting another linear combination that maximizes the variance with the constraint that its direction is perpendicular to the first component. The subsequent components (equal to the number of variables) would follow this same rule.

A couple of things here. This definition describes the linear combination, which is one of the key assumptions in PCA. If you ever try and apply PCA to a dataset of variables having a low correlation, you will likely end up with a meaningless analysis. Another key assumption is that the mean and variance for a variable are sufficient statistics. What this tells us is that the data should fit a normal distribution so that the covariance matrix fully describes our dataset, that is, multivariate normality. PCA is fairly robust to non-normally distributed data and is even used in conjunction with binary variables, so the results are still interpretable.

Now, what is this direction described here and how is the linear combination determined? The best way to grasp this subject is with a visualization. Let's take a small dataset with two variables and plot it. PCA is sensitive to scale, so the data has been scaled with a mean of zero and standard deviation of one. You can see in the following figure that this data happens to form the shape of an oval with the diamonds representing each observation:

An overview of the principal components

Looking at the plot, the data has the most variance along the x axis, so we can draw a dashed horizontal line to represent our first principal component as shown in the following image. This component is the linear combination of our two variables or PC1 = α11X1 + α12X2', where the coefficient weights are the variable loadings on the principal component. They form the basis of the direction along which the data varies the most. This equation is constrained by 1 in order to prevent the selection of arbitrarily high values. Another way to look at this is that the dashed line minimizes the distance between itself and the data points. This distance is shown for a couple of points as arrows, as follows:

An overview of the principal components

The second principal component is then calculated in the same way, but it is uncorrelated with the first, that is, its direction is at a right angle or orthogonal to the first principal component. The following plot shows the second principal component added as a dotted line:

An overview of the principal components

With the principal component loadings calculated for each variable, the algorithm will then provide us with the principal component scores. The scores are calculated for each principal component for each observation. For PC1 and the first observation, this would equate to the formula: Z11 = α11 * (X11 – average of X1) + α12 * (X12 – average of X2). For PC2 and the first observation, the equation would be Z12 = α21 * (X11 – average of X2) + α22 * (X12 – average of X2). These principal component scores are now the new feature space to be used in whatever analysis you will undertake.

Recall that the algorithm will create as many principal components as there are variables, accounting for 100 percent of the possible variance. So, how do we narrow down the components to achieve the original objective in the first place? There are some heuristics that one can use, and in the upcoming modeling process, we will look at the specifics but a common method to select a principal component is if its eigenvalue is greater than one. While the algebra behind the estimation of eigenvalues and eigenvectors is outside the scope of this book, it is important to discuss what they are and how they are used in PCA.

Recall that the equation for the first principal component is PC1 = α11X1 + α12X2. The optimized linear weights are determined using linear algebra in order to create what is referred to as an eigenvector. They are optimal because no other possible combination of weights could explain variation better than they do. The eigenvalue for a principal component then is the total amount of variation that it explains in the entire dataset. As the first principal component accounts for the largest amount of variation, it will have the largest eigenvalue. The second component will have the second highest eigenvalue and so forth. So, an eigenvalue greater than one indicates that the principal component accounts for more variance than any of the original variables does by itself. If you standardize the sum of all the eigenvalues to one, you will have the percentage of the total variance that each component explains. This will also aid you in determining a proper cut-off point.

The eigenvalue criterion is certainly not a hard-and-fast rule and must be balanced with your knowledge of the data and business problem at hand. Once you have selected the number of principal components, you can rotate them in order to simplify their interpretation.

Rotation

Should you rotate or not? As stated previously, rotation helps in the interpretation of the principal components by modifying the loadings of each variable. The overall variation explained by the rotated number of components will not change, but the contributions to the total variance explained by each component will change. What you will find by rotation is that the loading values will either move farther or closer to zero, theoretically aiding in identifying those variables that are important to each principal component. This is an attempt to associate a variable to only one principal component. Remember that this is unsupervised learning, so you are trying to understand your data, not test some hypothesis. In short, rotation aids you in this endeavor.

The most common form of principal component rotation is known as varimax. There are other forms such as quartimax and equimax, but we will focus on varimax rotation. In my experience, I've never seen the other methods provide better solutions. Trial and error on your part may be the best way to decide the issue.

With varimax, we are maximizing the sum of the variances of the squared loadings. The varimax procedure rotates the axis of the feature space and their coordinates without changing the locations of the data points. Perhaps, the best way to demonstrate this is via another simple illustration. Let's assume that we have a dataset of variables A through G and we have two principal components. Plotting this data, we will end up with the following illustration:

Rotation

For the sake of argument, let's say that variable A's loadings are -0.4 on PC1 and 0.1 on PC2. Now, let's say that variable D's loadings are 0.4 on PC1 and -0.3 on PC2. For point E, the loadings are -0.05 and -0.7, respectively. Note that the loadings will follow the direction of the principal component. After running a varimax procedure, the rotated components will look as follows:

Rotation

The following are the new loadings on PC1 and PC2 after rotation:

  • Variable A: -0.5 and 0.02
  • Variable D: 0.5 and -0.3
  • Variable E: 0.15 and -0.75

The loadings have changed but the data points have not. With this simple illustration, we can't say that we have simplified the interpretation, but this should help you understand what is happening during the rotation of the principal components.

Business understanding

For this example, we will delve into the world of sports; in particular, the National Hockey League (NHL). Much work has been done on baseball (think of the book and movie, Moneyball) and football; both are American and games that people around the world play with their feet. For my money, there is no better spectator sport than hockey. Perhaps, that is an artifact of growing up on the frozen prairie of North Dakota. Nonetheless, we can consider this analysis as our effort to start a Moneypuck movement.

In this analysis, we will look at the statistics for 30 NHL teams and download the data from a table at www.nhl.com. The goal is to build a model that predicts the total points for a team from an input feature space developed using PCA in order to provide us with some insight on what it takes to be a top professional team. There will be 14 variables included in the PCA and we can reasonably expect to end up with less than half a dozen principal components.

It is important to understand how the NHL awards points to the teams. Unlike football or baseball, where only wins and losses count, professional hockey uses the following point system for each game:

  • The winner gets two points whether that is in regulation, overtime, or as a result of the post-overtime shootout
  • A regulation loser receives no points
  • An overtime or shootout loser receives one point; the so-called loser point

The NHL started this point system in 2005 and it is not without controversy, but it hasn't detracted from the game's elegant and graceful violence.

Data understanding and preparation

To begin with, we will load the necessary packages in order to download the data and conduct the analysis. Please ensure that you have these packages installed prior to loading:

> library(corrplot) #correlation plot

> library(FactoMineR) #additional PCA analysis

> library(ggplot2) #support scatterplot

> library(GPArotation) #supports rotation

> library(psych) #PCA package

The data is available online as a comma delimited file. My original intention was to show how to use R to scrape a table, in this case, on the website nhl.com. However, with the number of periodic changes to websites, the code to do this may become useless. Therefore, I've compiled the data and stored it on textloader.com. The first thing to do is create an object containing the URL as follows:

> url="http://textuploader.com/ae6t4/raw"

Now, just create the dataset, calling it nhl using the read.csv() function with as.data.frame() as a wrapper. Also, as no headings exist in the data, specify header = FALSE:

> nhl = as.data.frame(read.csv(url, header=FALSE))
nhl

To verify that we have the correct data, let's use the data structure function, str(). For brevity, I've included only the first few lines of the output of the command:

> str(nhl)
'data.frame': 30 obs. of  25 variables:
 $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ V2 : Factor w/ 30 levels "ANAHEIM","ARIZONA",..: 20 25 1 16 26 7 28 17 19 15 ...
 $ V3 : int  82 82 82 82 82 82 82 82 82 82 ...
 $ V4 : int  53 51 51 50 50 48 48 47 47 46 ...

The next thing that we will need to do is look at the variable names. When downloading an HTML table, this can produce the following nonsensical names and we will need to specify our own:

> names(nhl)
[1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
[13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23" "V24"
[25] "V25"

To change all the variable names, we will use the following names() function, putting the new names in the c() function, which stands for combine:

> names(nhl) = c("rank","team","played","wins","losses","OTL","pts","ROW","HROW","RROW","ppc","gg","gag","five","PPP","PKP","shots","sag","sc1","tr1","lead1","lead2","wop","wosp","face")

With the variables renamed, let's go over what they mean:

  • rank: This is the rank of a team's total points
  • team: This is the team's city
  • played: This is the games that are played
  • wins: This is the number of total wins
  • losses: This is the number of total losses in regulation
  • OTL: This is the number of total losses in overtime
  • pts: This is the number of total points
  • ROW: This is the number of regulation plus overtime wins
  • HROW: This is the number of home regulation plus overtime wins
  • RROW: This is the number of road regulation plus overtime wins
  • ppc: This is the percentage of the points per games played
  • gg: This is the number of goals per game
  • gag: This is the number of goals allowed per game
  • five: This is the 5-on-5 goals against/for ratio that is both the teams' full strength
  • PPP: This is the percentage of the time goals scored on the power play
  • PKP: This is the percentage of the time goals allowed on the power play
  • shots: This is the number of shots on the goal per game
  • sag: This is the number of shots on the goal allowed per game
  • sc1: This is the winning percentage when scoring first
  • tr1: This is the winning percentage when trailing first
  • lead1: This is the winning percentage when leading after the first period
  • lead2: This is the winning percentage when leading after the second period
  • wop: This is the winning percentage when outshooting an opponent
  • wosp: This is the winning percentage when outshot by an opponent
  • face: This is the percentage of the faceoffs won

One of the things that you can do with the data in R is sort it by one or more variables. Here is an example to identify the team with the fewest goals per game and also the team with the most goals per game. If the data was numeric, you could use the max() and min() functions. As the variables are characters, we will use the order() function to perform a sorting on our variable of interest and then call the appropriate row and column in order to identify the teams. This function defaults to an ascending order, but if your data is numeric, you can also sort the data in a descending order using the minus sign in front of your variable. Here is the code to sort and identify our teams of interest:

> nhl=nhl[order(nhl$gg),]

> nhl[1,2]
[1] BUFFALO

> nhl[30,2]
[1] TAMPA BAY

Note that once we sorted it, the process to identify a team was as simple as specifying the appropriate row (1 for the worst and 30 for the best) and the column two for the team name. As you can see, BUFFALO is the worst and TAMPA BAY is the best for average goals per game.

As we are concerned about creating the principal components in order to predict a team's points, we do not need to include a few of the variables in the data. In fact, let's only focus on variables 12 through 25 and create a new data frame call pca.df and drop columns 1 through 11, as follows:

> pca.df = nhl[,c(-1:-11)]

The next order of business is to convert the data frame's values to numeric; recall that all the variables are currently characters. For this task, we will break out the lapply() function, which applies another function over a list or vectors. So, what we want to apply over the data frame in order to convert the values to numeric is as.numeric(). Additionally, when we are done applying this function, we want pca.df to still be a data frame. Therefore, we will put lapply() and as.numeric() inside as.data.frame().

This is all easy for me to say, but it is actually rather elegant and powerful to transform the values in a data frame:

> pca.df = as.data.frame(lapply(pca.df , as.numeric))

Anytime that you create a subset and/or make a major transformation, it is probably a good idea to check the structure after this transformation. In our case, we want to make sure that the values are indeed numeric, as follows:

> str(pca.df)
'data.frame':30 obs. of  14 variables:
$ gg   : num  1.87 2.01 2.15 2.23 2.35 2.42 2.51 2.55 2.55 
2.58 ...
 $ gag  : num  3.28 3.26 2.55 2.67 3.37 2.6 3.13 2.45 2.72 
2.72 ...
 $ five : num  0.61 0.62 0.93 0.76 0.68 1.03 0.8 1.04 0.99 
1.01 ...
 $ PPP  : num  13.4 20 19.3 18.8 17.7 16.3 15.9 17.8 15 
23.4 ...
 $ PKP  : num  75.1 76.7 80.6 84.7 76.7 80 80.4 82 84.6 
77.1 ...
 $ shots: num  24.2 29.2 24.5 30.8 28.4 30.7 29.2 31 27.9 
29.4 ...
 $ sag  : num  35.6 33.2 30.7 27.3 30 29.6 33.5 29.9 33.2 
30.3 ...
 $ sc1  : num  0.528 0.424 0.533 0.611 0.471 0.697 0.677 
0.674 0.676 0.688 ...
 $ tr1  : num  0.087 0.204 0.216 0.174 0.167 0.306 0.176 
0.308 0.311 0.22 ...
 $ lead1: num  0.563 0.522 0.593 0.708 0.435 0.789 0.762 
0.742 0.696 0.731 ...
 $ lead2: num  0.647 0.824 0.781 0.759 0.63 0.75 0.826 
0.871 0.852 0.864 ...
 $ wop  : num  0.5 0.385 0.235 0.423 0.333 0.386 0.375 
0.512 0.6 0.475 ...
 $ wosp : num  0.246 0.259 0.438 0.25 0.273 0.541 
0.382 0.471 0.418 0.341 ...
 $ face : num  44.9 51.8 47.3 53 48.2 48.6 49 53.6 50.8 
51.1 ...

There you have it! Now, let's do what we did in the prior chapters and build a correlation matrix and plot it using a function from the corrplot package. First, create an object using the cor() function, which builds the correlation matrix, as follows:

> nhl.cor = cor(pca.df)

You can call nhl.cor if you want, but I am very partial to plotting this data. We will do a matrix plot of the correlation values. In the syntax of the corrplot() function we will specify the method as ellipse:

> corrplot(nhl.cor, method="ellipse")

The following is the output of the preceding command:

Data understanding and preparation

The first thing that jumps out and actually takes me by surprise is that the faceoff percentage and power play percentage don't show a strong correlation with anything. They do show a small correlation with each other. Another thing of interest is how strongly and negatively correlated are the 5-on-5 ratio and goals allowed per game, but it is not quite as strong as the correlation between playing full strength and the goals scored per game. If you follow hockey, you may find other interesting correlations here, but one thing is clear and that is the fact that many of the variables are highly correlated. As such, this should be a good dataset to extract several principal components. If we had a few, if any, correlations in this or any other dataset, then mostly any effort to extract the components would be futile.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset