PCA working methodology from first principles

PCA working methodology is described in the following sample data, which has two dimensions for each instance or data point. The objective here is to reduce the 2D data into one dimension (also known as the principal component):

Instance

X

Y

1

0.72

0.13

2

0.18

0.23

3

2.50

2.30

4

0.45

0.16

5

0.04

0.44

6

0.13

0.24

7

0.30

0.03

8

2.65

2.10

9

0.91

0.91

10

0.46

0.32

Column mean

0.83

0.69

 

The first step, prior to proceeding with any analysis, is to subtract the mean from all the observations, which removes the scale factor of variables and makes them more uniform across dimensions.

X

Y

0.72 - 0.83 = -0.12

0.13 - 0.69 = - 0.55

0.18 - 0.83 = -0.65

0.23 - 0.69 = - 0.46

2.50 - 0.83 = 1.67

2.30 - 0.69 = 1.61

0.45 - 0.83 = -0.38

0.16 - 0.69 = - 0.52

0.04 - 0.83 = -0.80

0.44 - 0.69 = - 0.25

0.13 - 0.83 = -0.71

0.24 - 0.69 = - 0.45

0.30 - 0.83 = -0.53

0.03 - 0.69 = - 0.66

2.65 - 0.83 = 1.82

2.10 - 0.69 = 1.41

0.91 - 0.83 = 0.07

0.91 - 0.69 = 0.23

0.46 - 0.83 = -0.37

0.32 - 0.69 = -0.36

 

Principal components are calculated using two different techniques:

  • Covariance matrix of the data
  • Singular value decomposition

We will be covering the singular value decomposition technique in the next section. In this section, we will solve eigenvectors and eigenvalues using covariance matrix methodology.

Covariance is a measure of how much two variables change together and it is a measure of the strength of the correlation between two sets of variables. If the covariance of two variables is zero, we can conclude that there will not be any correlation between two sets of the variables. The formula for covariance is as follows:

A sample covariance calculation is shown for X and Y variables in the following formulas. However, it is a 2 x 2 matrix of an entire covariance matrix (also, it is a square matrix).

Since the covariance matrix is square, we can calculate eigenvectors and eigenvalues from it. You can refer to the methodology explained in an earlier section.

By solving the preceding equation, we can obtain eigenvectors and eigenvalues, as follows:

The preceding mentioned results can be obtained with the following Python syntax:

>>> import numpy as np
>>> w, v = np.linalg.eig(np.array([[ 0.91335 ,0.75969 ],[ 0.75969,0.69702]]))
>>> print (" Eigen Values ", w)
>>> print (" Eigen Vectors ", v)

Once we obtain the eigenvectors and eigenvalues, we can project data into principal components. The first eigenvector has the greatest eigenvalue and is the first principal component, as we would like to reduce the original 2D data into 1D data.

From the preceding result, we can see the 1D projection of the first principal component from the original 2D data. Also, the eigenvalue of 1.5725 explains the fact that the principal component explains variance of 57 percent more than the original variables. In the case of multi-dimensional data, the thumb rule is to select the eigenvalues or principal components with a value greater than what should be considered for projection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset