Performing Principal Component Analysis

There are several terms that we must define before discussing how principal component analysis works.

Variance, Covariance, and Covariance Matrices

Recall that variance is a measure of how a set of values are spread out. Variance is calculated as the average of the squared differences of the values and mean of the values, as per the following equation:

Variance, Covariance, and Covariance Matrices

Covariance is a measure of how much two variables change together; it is a measure of the strength of the correlation between two sets of variables. If the covariance of two variables is zero, the variables are uncorrelated. Note that uncorrelated variables are not necessarily independent, as correlation is only a measure of linear dependence. The covariance of two variables is calculated using the following equation:

Variance, Covariance, and Covariance Matrices

If the covariance is nonzero, the sign indicates whether the variables are positively or negatively correlated. When two variables are positively correlated, one increases as the other increases. When variables are negatively correlated, one variable decreases relative to its mean as the other variable increases relative to its mean. A covariance matrix describes the covariance values between each pair of dimensions in a data set. The element Variance, Covariance, and Covariance Matrices indicates the covariance of the Variance, Covariance, and Covariance Matrices and Variance, Covariance, and Covariance Matrices dimensions of the data. For example, a covariance matrix for a three-dimensional data is given by the following matrix:

Variance, Covariance, and Covariance Matrices

Let's calculate the covariance matrix for the following data set:

2

0

−1.4

2.2

0.2

−1.5

2.4

0.1

−1

1.9

0

−1.2

The means of the variables are 2.125, 0.075, and -1.275. We can then calculate the covariances of each pair of variables to produce the following covariance matrix:

Variance, Covariance, and Covariance Matrices

We can verify our calculations using NumPy:

>>> import numpy as np
>>> X = [[2, 0, -1.4],
>>>     [2.2, 0.2, -1.5],
>>>     [2.4, 0.1, -1],
>>>     [1.9, 0, -1.2]]
>>> print np.cov(np.array(X).T)
[[ 0.04916667  0.01416667  0.01916667]
 [ 0.01416667  0.00916667 -0.00583333]
 [ 0.01916667 -0.00583333  0.04916667]]

Eigenvectors and eigenvalues

A vector is described by a direction and magnitude, or length. An eigenvector of a matrix is a non-zero vector that satisfies the following equation:

Eigenvectors and eigenvalues

In the preceding equation, Eigenvectors and eigenvalues is an eigenvector, A is a square matrix, and Eigenvectors and eigenvalues is a scalar called an eigenvalue. The direction of an eigenvector remains the same after it has been transformed by A; only its magnitude has changed, as indicated by the eigenvalue; that is, multiplying a matrix by one of its eigenvectors is equal to scaling the eigenvector. The prefix eigen is the German word for belonging to or peculiar to; the eigenvectors of a matrix are the vectors that belong to and characterize the structure of the data.

Eigenvectors and eigenvalues can only be derived from square matrices, and not all square matrices have eigenvectors or eigenvalues. If a matrix does have eigenvectors and eigenvalues, it will have a pair for each of its dimensions. The principal components of a matrix are the eigenvectors of its covariance matrix, ordered by their corresponding eigenvalues. The eigenvector with the greatest eigenvalue is the first principal component; the second principal component is the eigenvector with the second greatest eigenvalue, and so on.

Let's calculate the eigenvectors and eigenvalues of the following matrix:

Eigenvectors and eigenvalues

Recall that the product of A and any eigenvector of A must be equal to the eigenvector multiplied by its eigenvalue. We will begin by finding the eigenvalues, which we can find using the following characteristic equations:

Eigenvectors and eigenvalues
Eigenvectors and eigenvalues

The characteristic equation states that the determinant of the matrix, that is, the difference between the data matrix and the product of the identity matrix and an eigenvalue is zero:

Eigenvectors and eigenvalues

Both of the eigenvalues for this matrix are equal to -1. We can now use the eigenvalues to solve the eigenvectors:

Eigenvectors and eigenvalues

First, we set the equation equal to zero:

Eigenvectors and eigenvalues

Substituting our values for A produces the following:

Eigenvectors and eigenvalues

We can then substitute the first eigenvalue in our first eigenvalue to solve the eigenvectors.

Eigenvectors and eigenvalues

The preceding equation can be rewritten as a system of equations:

Eigenvectors and eigenvalues

Any non-zero vector that satisfies the preceding equations, such as the following, can be used as an eigenvector:

Eigenvectors and eigenvalues

PCA requires unit eigenvectors, or eigenvectors that have a length equal to 1. We can normalize an eigenvector by dividing it by its norm, which is given by the following equation:

Eigenvectors and eigenvalues

The norm of our vector is equal to the following:

Eigenvectors and eigenvalues

This produces the following unit eigenvector:

Eigenvectors and eigenvalues

We can verify that our solutions for the eigenvectors are correct using NumPy. The eig function returns a tuple of the eigenvalues and eigenvectors:

>>> import numpy as np
>>> w, v = np.linalg.eig(np.array([[1, -2], [2, -3]]))
>>> w; v
array([-0.99999998, -1.00000002])
array([[ 0.70710678,  0.70710678],

Dimensionality reduction with Principal Component Analysis

Let's use principal component analysis to reduce the following two-dimensional data set to one dimension:

x1

x2

0.9

1

2.4

2.6

1.2

1.7

0.5

0.7

0.3

0.7

1.8

1.4

0.5

0.6

0.3

0.6

2.5

2.6

1.3

1.1

The first step of PCA is to subtract the mean of each explanatory variable from each observation:

x1

x2

0.9 - 1.17 = -0.27

1 - 1.3 = -0.3

2.4 - 1.17 = 1.23

2.6 - 1.3 = 1.3

1.2 - 1.17 = 0.03

1.7 - 1.3 = 0.4

0.5 - 1.17 = -0.67

-0.7 - 1.3 = 0.6

0.3 - 1.17 = -0.87

-0.7 - 1.3 = 0.6

1.8 - 1.17 = 0.63

1.4 - 1.3 = 0.1

0.5 - 1.17 = -0.67

0.6 - 1.3 = -0.7

0.3 - 1.17 = -0.87

0.6 - 1.3 = -0.7

2.5 - 1.17 = 1.33

2.6 - 1.3 = 1.3

1.3 - 1.17 = 0.13

1.1 - 1.3 = -0.2

Next, we must calculate the principal components of the data. Recall that the principal components are the eigenvectors of the data's covariance matrix ordered by their eigenvalues. The principal components can be found using two different techniques. The first technique requires calculating the covariance matrix of the data. Since the covariance matrix will be square, we can calculate the eigenvectors and eigenvalues using the approach described in the previous section. The second technique uses singular value decomposition of the data matrix to find the eigenvectors and square roots of the eigenvalues of the covariance matrix. We will work through an example using the first technique, and then describe the second technique that is used by scikit-learn's implementation of PCA.

The following matrix is the covariance matrix for the data:

Dimensionality reduction with Principal Component Analysis

Using the technique described in the previous section, the eigenvalues are 1.250 and 0.034. The following are the unit eigenvectors:

Dimensionality reduction with Principal Component Analysis

Next, we will project the data onto the principal components. The first eigenvector has the greatest eigenvalue and is the first principal component. We will build a transformation matrix in which each column of the matrix is the eigenvector for a principal component. If we were reducing a five-dimensional data set to three dimensions, we would build a matrix with three columns. In this example, we will project our two-dimensional data set onto one dimension, so we will use only the eigenvector for the first principal component. Finally, we will find the dot product of the data matrix and transformation matrix. The following is the result of projecting our data onto the first principal component:

Dimensionality reduction with Principal Component Analysis

Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:

Dimensionality reduction with Principal Component Analysis

The columns of Dimensionality reduction with Principal Component Analysis are called left singular vectors of the data matrix, the columns of Dimensionality reduction with Principal Component Analysis are its right singular vectors, and the diagonal entries of Dimensionality reduction with Principal Component Analysis are its singular values. While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of Dimensionality reduction with Principal Component Analysis are the square roots of the eigenvalues of the covariance matrix. Calculating SVD is beyond the scope of this chapter; however, eigenvectors found using SVD should be similar to those derived from a covariance matrix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset