There are several terms that we must define before discussing how principal component analysis works.
Recall that variance is a measure of how a set of values are spread out. Variance is calculated as the average of the squared differences of the values and mean of the values, as per the following equation:
Covariance is a measure of how much two variables change together; it is a measure of the strength of the correlation between two sets of variables. If the covariance of two variables is zero, the variables are uncorrelated. Note that uncorrelated variables are not necessarily independent, as correlation is only a measure of linear dependence. The covariance of two variables is calculated using the following equation:
If the covariance is nonzero, the sign indicates whether the variables are positively or negatively correlated. When two variables are positively correlated, one increases as the other increases. When variables are negatively correlated, one variable decreases relative to its mean as the other variable increases relative to its mean. A covariance matrix describes the covariance values between each pair of dimensions in a data set. The element indicates the covariance of the and dimensions of the data. For example, a covariance matrix for a three-dimensional data is given by the following matrix:
Let's calculate the covariance matrix for the following data set:
2 |
0 |
−1.4 |
2.2 |
0.2 |
−1.5 |
2.4 |
0.1 |
−1 |
1.9 |
0 |
−1.2 |
The means of the variables are 2.125, 0.075, and -1.275. We can then calculate the covariances of each pair of variables to produce the following covariance matrix:
We can verify our calculations using NumPy:
>>> import numpy as np >>> X = [[2, 0, -1.4], >>> [2.2, 0.2, -1.5], >>> [2.4, 0.1, -1], >>> [1.9, 0, -1.2]] >>> print np.cov(np.array(X).T) [[ 0.04916667 0.01416667 0.01916667] [ 0.01416667 0.00916667 -0.00583333] [ 0.01916667 -0.00583333 0.04916667]]
A vector is described by a direction and magnitude, or length. An eigenvector of a matrix is a non-zero vector that satisfies the following equation:
In the preceding equation, is an eigenvector, A is a square matrix, and is a scalar called an eigenvalue. The direction of an eigenvector remains the same after it has been transformed by A; only its magnitude has changed, as indicated by the eigenvalue; that is, multiplying a matrix by one of its eigenvectors is equal to scaling the eigenvector. The prefix eigen is the German word for belonging to or peculiar to; the eigenvectors of a matrix are the vectors that belong to and characterize the structure of the data.
Eigenvectors and eigenvalues can only be derived from square matrices, and not all square matrices have eigenvectors or eigenvalues. If a matrix does have eigenvectors and eigenvalues, it will have a pair for each of its dimensions. The principal components of a matrix are the eigenvectors of its covariance matrix, ordered by their corresponding eigenvalues. The eigenvector with the greatest eigenvalue is the first principal component; the second principal component is the eigenvector with the second greatest eigenvalue, and so on.
Let's calculate the eigenvectors and eigenvalues of the following matrix:
Recall that the product of A and any eigenvector of A must be equal to the eigenvector multiplied by its eigenvalue. We will begin by finding the eigenvalues, which we can find using the following characteristic equations:
The characteristic equation states that the determinant of the matrix, that is, the difference between the data matrix and the product of the identity matrix and an eigenvalue is zero:
Both of the eigenvalues for this matrix are equal to -1. We can now use the eigenvalues to solve the eigenvectors:
First, we set the equation equal to zero:
Substituting our values for A produces the following:
We can then substitute the first eigenvalue in our first eigenvalue to solve the eigenvectors.
The preceding equation can be rewritten as a system of equations:
Any non-zero vector that satisfies the preceding equations, such as the following, can be used as an eigenvector:
PCA requires unit eigenvectors, or eigenvectors that have a length equal to 1. We can normalize an eigenvector by dividing it by its norm, which is given by the following equation:
The norm of our vector is equal to the following:
This produces the following unit eigenvector:
We can verify that our solutions for the eigenvectors are correct using NumPy. The
eig
function returns a tuple of the eigenvalues and eigenvectors:
>>> import numpy as np >>> w, v = np.linalg.eig(np.array([[1, -2], [2, -3]])) >>> w; v array([-0.99999998, -1.00000002]) array([[ 0.70710678, 0.70710678],
Let's use principal component analysis to reduce the following two-dimensional data set to one dimension:
x1 |
x2 |
---|---|
0.9 |
1 |
2.4 |
2.6 |
1.2 |
1.7 |
0.5 |
0.7 |
0.3 |
0.7 |
1.8 |
1.4 |
0.5 |
0.6 |
0.3 |
0.6 |
2.5 |
2.6 |
1.3 |
1.1 |
The first step of PCA is to subtract the mean of each explanatory variable from each observation:
x1 |
x2 |
---|---|
0.9 - 1.17 = -0.27 |
1 - 1.3 = -0.3 |
2.4 - 1.17 = 1.23 |
2.6 - 1.3 = 1.3 |
1.2 - 1.17 = 0.03 |
1.7 - 1.3 = 0.4 |
0.5 - 1.17 = -0.67 |
-0.7 - 1.3 = 0.6 |
0.3 - 1.17 = -0.87 |
-0.7 - 1.3 = 0.6 |
1.8 - 1.17 = 0.63 |
1.4 - 1.3 = 0.1 |
0.5 - 1.17 = -0.67 |
0.6 - 1.3 = -0.7 |
0.3 - 1.17 = -0.87 |
0.6 - 1.3 = -0.7 |
2.5 - 1.17 = 1.33 |
2.6 - 1.3 = 1.3 |
1.3 - 1.17 = 0.13 |
1.1 - 1.3 = -0.2 |
Next, we must calculate the principal components of the data. Recall that the principal components are the eigenvectors of the data's covariance matrix ordered by their eigenvalues. The principal components can be found using two different techniques. The first technique requires calculating the covariance matrix of the data. Since the covariance matrix will be square, we can calculate the eigenvectors and eigenvalues using the approach described in the previous section. The second technique uses singular value decomposition of the data matrix to find the eigenvectors and square roots of the eigenvalues of the covariance matrix. We will work through an example using the first technique, and then describe the second technique that is used by scikit-learn's implementation of PCA.
The following matrix is the covariance matrix for the data:
Using the technique described in the previous section, the eigenvalues are 1.250 and 0.034. The following are the unit eigenvectors:
Next, we will project the data onto the principal components. The first eigenvector has the greatest eigenvalue and is the first principal component. We will build a transformation matrix in which each column of the matrix is the eigenvector for a principal component. If we were reducing a five-dimensional data set to three dimensions, we would build a matrix with three columns. In this example, we will project our two-dimensional data set onto one dimension, so we will use only the eigenvector for the first principal component. Finally, we will find the dot product of the data matrix and transformation matrix. The following is the result of projecting our data onto the first principal component:
Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:
The columns of are called left singular vectors of the data matrix, the columns of are its right singular vectors, and the diagonal entries of are its singular values. While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of are the square roots of the eigenvalues of the covariance matrix. Calculating SVD is beyond the scope of this chapter; however, eigenvectors found using SVD should be similar to those derived from a covariance matrix.