Extracting the principal components

The first technique that we will look at is the Principal Component Analysis (PCA). PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss, in other words, retaining the maximum variation in the data. By variation, we mean the direction in which the data is dispersed to the maximum. Let's look at the following plot:

Extracting the principal components

We have a scatter plot of two variables, x1 and x2. The diagonal line indicates the maximum variation. By using PCA, our intent is to capture this direction of the variation. So, instead of using the direction of two variables, x1 and x2, to represent this data, the quest is to find a vector represented by the blue line and represent the data with only this vector. Essentially we want to reduce the dimension of the data from two to one.

We will leverage the mathematical tools Eigenvalues and Eigenvectors to find this blue line vector.

We saw in the previous chapter that the variance measures the amount of dispersion or spread in the data. What we saw was an example in one dimension. In case of more than one dimension it is easy to express correlation among the variables as a matrix, called as Covariance matrix. When the values of the Covariance matrix are normalized by standard deviation we get a Correlation matrix. In our case, the covariance matrix is a 2 X 2 matrix for two variables, x1 and x2, and it measures how much these two variables move in the same direction or generally vary together.

When we perform Eigenvalue decomposition, that is, get the Eigenvectors and Eigenvalues of the covariance matrix, the principal Eigenvector, which is the vector with the largest Eigenvalue, is in the direction of the maximum variance in the original data.

In our example, this should be the vector that is represented by the blue line in our graph. We will then proceed to project our input data in this blue line vector in order to get the reduced dimension.

Note

With a dataset (n x m) with n instances and m dimensions, PCA projects it onto a smaller subspace (n x d), where d << m.

A point to note is that PCA is computationally very expensive.

PCA can be performed on both the covariance and correlation matrix. Remember when a Covariance matrix of a dataset with unevenly scaled datasets is used in PCA, the results may not be very useful. Curious readers can refer to the Book A First Course in Multivariate Statistics by Bernard Flury on the topic of using either correlation or covariance matrix for PCA.

http://www.springer.com/us/book/9780387982069.

Getting ready

Let's use the Iris dataset to understand how to use PCA efficiently in reducing the dimension of the dataset. The Iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset are as follows:

  • Iris Setosa
  • Iris Versicolor
  • Iris Virginica

The following are the four features in the Iris dataset:

  • The sepal length in cm
  • The sepal width in cm
  • The petal length in cm
  • The petal width in cm

Can we use, say, two columns instead of all the four to express most of the variations in the data? Our quest is to reduce the dimension of the data. In this case, our instances have four columns. Let's say that we are building a classifier to predict the type of flower with a new instance; can we do this task using instances in the reduced dimension space? Can we reduce the number of columns from four to two and still achieve a good accuracy for our classifier?

PCA is done using the following steps:

  1. Standardize the dataset to have a zero mean value.
  2. Find the correlation matrix for the dataset and unit standard deviation value.
  3. Reduce the Correlation matrix matrix into its Eigenvectors and values.
  4. Select the top nEigenvectors based on the Eigenvalues sorted in descending order.
  5. Project the input Eigenvectors matrix into the new subspace.

How to do it…

Let's load the necessary libraries and call the utility function load_iris from scikit-learn to get the Iris dataset:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import scale
import scipy
import matplotlib.pyplot as plt

# Load Iris data
data = load_iris()
x = data['data']
y = data['target']

# Since PCA is an unsupervised method, we will not be using the target variable y
# scale the data such that mean = 0 and standard deviation = 1
x_s = scale(x,with_mean=True,with_std=True,axis=0)

# Calculate correlation matrix
x_c = np.corrcoef(x_s.T)

# Find eigen value and eigen vector from correlation matrix
eig_val,r_eig_vec = scipy.linalg.eig(x_c)
print 'Eigen values 
%s'%(eig_val)
print '
 Eigen vectors 
%s'%(r_eig_vec)


# Select the first two eigen vectors.
w = r_eig_vec[:,0:2]

# # Project the dataset in to the dimension
# from 4 dimension to 2 using the right eignen vector
x_rd = x_s.dot(w)

# Scatter plot the new two dimensions
plt.figure(1)
plt.scatter(x_rd[:,0],x_rd[:,1],c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")

Now, we will proceed to Standardize this data, with a zero mean and standard deviation of one, we will leverage the numpyscorr_coef function to find the correlation matrix:

x_s = scale(x,with_mean=True,with_std=True,axis=0)
x_c = np.corrcoef(x_s.T)

We will then do the Eigenvalue decomposition and project our Iris data on the first two principal Eigenvectors. Finally, we will plot the dataset in the reduced space:

eig_val,r_eig_vec = scipy.linalg.eig(x_c)
print 'Eigen values 
%s'%(eig_val)
print '
 Eigen vectors 
%s'%(r_eig_vec)
# Select the first two eigen vectors.
w = r_eig_vec[:,0:2]

# # Project the dataset in to the dimension
# from 4 dimension to 2 using the right eignen vector
x_rd = x_s.dot(w)

# Scatter plot the new two dimensions
plt.figure(1)
plt.scatter(x_rd[:,0],x_rd[:,1],c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")

Using function scale. The scale function can perform centering, scaling and standardization. Centering is subtracting the mean value from individual values, Scaling is dividing each value by the variable's standard deviation and finally Standardization is performing centering followed by scaling. Using variables with_mean and with_std function scale can be used to perform all three normalization techniques.

How it works…

The Iris dataset has four columns. Though there are not many columns, it will serve our purpose. We intend to reduce the dimensionality of the Iris dataset to two from four and still retain all the information about the data.

We will load the Iris data to the x and y variables using the convenient load_iris function from scikit-learn. The x variable is our data matrix and we can inspect its shape as follows:

>>>x.shape
(150, 4)
>>>

We will scale the data matrix x to have zero mean and unit standard deviation. The rule of thumb is that if all your columns are measured in the same scale in your data and have the same unit of measurement, you don't have to scale the data. This will allow PCA to capture these basic units with the maximum variation:

x_s = scale(x,with_mean=True,with_std=True,axis=0)

We will proceed to build the correlation matrix of our input data:

The correlation matrix of n random variables X1, ..., Xn is then × n matrix whosei, jentry is corr (Xi, Xj), Wikipedia.

We will then use the SciPy library to calculate the Eigenvalues and Eigenvectors of the matrix.Let's look at our Eigenvalues and Eigenvectors:

print Eigen values 
%s%(eig_val)
print 
 Eigen vectors 
%s%(r_eig_vec)

The output looks as follows:

How it works…

In our case, the Eigenvalues are printed in a descending order. A key question is how many components should we choose? In the next section, we will explain a few ways of choosing the number of components.

You can see that we selected only the first two columns of our right-hand side Eigenvectors. The discrimination capability of the retained components on the y variable is a good test of how much information or variation is retained in the data.

We will project the data to the new reduced dimension.

Finally, we will plot the components in the x and y axes and color them by the target variable:

How it works…

You can see that components 1 and 2 are able to discriminate the three classes of the iris flowers. Thus we have effectively used PCA in reducing the dimension to two from four and still able to discriminate the instances belonging to different classes of Iris flower.

There's more…

In the previous section, we said that we would outline a couple of ways to help us select how many components should we include. In our recipe, we included only two. The following are a list of ways to select the components more empirically:

  1. The Eigenvalue criterion:

    An Eigenvalue of one would mean that the component would explain about one variable's worth of variability. So, according to this criterion, a component should at least explain one variable's worth of variability. We can say that we will include only those Eigenvalues whose value is greater than or equal to one. Based on your data set you can set the threshold. In a very large dimensional dataset including components capable of explaining only one variable may not be very useful.

  2. The proportion of the variance explained criterion:

    Let's run the following code:

    print "Component, Eigen Value, % of Variance, Cummulative %"
    cum_per = 0
    per_var = 0
    for i,e_val in enumerate(eig_val):
        per_var = round((e_val / len(eig_val)),3)
        cum_per+=per_var
    print ('%d, %0.2f, %0.2f, %0.2f')%(i+1, e_val, per_var*100,cum_per*100)
  3. The output is as follows:
    There's more…

For each component, we printed the Eigenvalue, percentage of the variance explained by that component, and cumulative percentage value of the variance explained. For example, component 1 has an Eigenvalue of 2.91; 2.91/4 gives the percentage of the variance explained, which is 72.80%. Now, if we include the first two components, then we can explain 95.80% of the variance in the data.

The decomposition of a correlation matrix into its Eigenvectors and values is a general technique that can be applied to any matrix. In this case, we will apply it to a correlation matrix in order to understand the principal axes of data distribution, that is, axes through which the maximum variation in the data is observed.

PCA can be used either as an exploratory technique or as a data preparation technique for a downstream algorithm. Document classification dataset problems typically have very large dimensional feature vectors. PCA can be used to reduce the dimension of the dataset in order to include only the most relevant features before feeding the data to a classification algorithm.

A drawback of PCA worth mentioning here is that it is computationally expensive operation. Finally a point about numpy's corrcoeff function. The corrcoeff function will standardize your data internally as a part of its calculation. But since we want to explicitly state the reason for scaling, we have included it in our recipe.

Tip

When would PCA work?

The input dataset should have correlated columns for PCA to work effectively. Without a correlation of the input variables, PCA cannot help us.

See also

  • Performing Singular Value Decomposition recipe in Chapter 4, Analyzing Data - Deep Dive
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset