Extracting features using singular value decomposition

After our discussion on PCA and kernel PCA, we can explain dimensionality reduction in the following way:

  • You can transform the correlated variables into a set of non-correlated variables. This way, we will have a less dimension explaining the relationship in the underlying data without any loss of information.
  • You can find out the principal axes, which has the most data variation recorded.

Singular Value Decomposition (SVD) is yet another matrix decomposition technique that can be used to tackle the curse of the dimensionality problem. It can be used to find the best approximation of the original data using fewer dimensions. Unlike PCA, SVD works on the original data matrix.

Note

SVD does not need a covariance or correlation matrix. It works on the original data matrix.

SVD factors an m x n matrix A into a product of three matrices:

Extracting features using singular value decomposition

Here, U is an m x k matrix, V is an n x k matrix, and S is a k x k matrix. The columns of U are called left singular vectors and columns of V are called right singular vectors.

The values on the diagonal of the S matrix are called singular values.

Getting ready

We will use the Iris dataset for this exercise. Our task in hand is to reduce the dimensionality of the dataset from four to two.

How to do it…

Let's load the necessary libraries and get the Iris dataset:

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import scale
from scipy.linalg import svd

# Load Iris dataset
data = load_iris()
x = data['data']
y = data['target']

# Proceed by scaling the x variable w.r.t its mean,
x_s = scale(x,with_mean=True,with_std=False,axis=0)


# Decompose the matrix using SVD technique.We will use SVD implementation in scipy.
U,S,V = svd(x_s,full_matrices=False)

# Approximate the original matrix by selecting only the first two singular values.
x_t = U[:,:2]


# Finally we plot the datasets with the reduced components.
plt.figure(1)
plt.scatter(x_t[:,0],x_t[:,1],c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

Now, we will demonstrate how to perform an SVD operation on the Iris dataset:

# Proceed by scaling the x variable w.r.t its mean,
x_s = scale(x,with_mean=True,with_std=False,axis=0)
# Decompose the matrix using SVD technique.We will use SVD implementation in scipy.
U,S,V = svd(x_s,full_matrices=False)

# Approximate the original matrix by selecting only the first two singular values.
x_t = U[:,:2]


# Finally we plot the datasets with the reduced components.
plt.figure(1)
plt.scatter(x_t[:,0],x_t[:,1],c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

How it works…

The Iris dataset has four columns. Though there are not many columns, it will serve our purpose. We intend to reduce the dimensionality of the Iris dataset to two from four and still retain all the information about the data.

We will load the Iris data to the x and y variables using the convenient load_iris function from scikit-learn. The x variable is our data matrix; we can inspect its shape in the following manner:

>>>x.shape
(150, 4)
>>>

We center the data matrix x using its mean. The rule of thumb is that if all your columns are measured in the same scale and have the same unit of measurement in the data, you don't have to scale the data. This will allow PCA to capture these basis units with the maximum variation. Note that we used only the mean while invoking the function scale:

x_s = scale(x,with_mean=True,with_std=False,axis=0)
  1. Run the SVD method on our scaled input dataset.
  2. Select the top two singular components. This matrix is a reduced approximation of the original input data.
  3. Finally, plot the columns and color it by the class value:
    How it works…

There's more…

SVD is a two-mode factor analysis, where we start with an arbitrary rectangular matrix with two types of entities. This is different from our previous recipe where we saw PCA that took a correlation matrix as an input. PCA is a one-mode factor analysis as the rows and columns in the input square matrix represent the same entity.

In text mining applications, the input is typically presented as a Term-document Matrix (TDM). In a TDM, the rows correspond to the words and columns are the documents. The cell entries are filled with either the term frequency or Term Frequency Inverse Document Frequency (TFIDF) score. It is a rectangular matrix with two entities: words and documents that are present in the rows and columns of the matrix.

SVD is widely used in text mining applications to uncover the hidden relationships (semantic relationship) between words and documents, documents and documents, and words and words.

By applying SVD on a term-document matrix, we transform it into a new semantic space, where the words that do not occur together in the same document can still be close in the new semantic space. The goal of SVD is to find a useful way to model the relationship between the words and documents. After applying SVD, each document and word can be represented as a vector of the factor values. We can choose to ignore the components with very low values and, hence, avoid noises in the underlying dataset. This leads to an approximate representation of our text corpus. This is called Latent Semantic Analysis (LSA).

The ramification of this idea has a very high applicability in document indexing for search and information retrieval. Instead of indexing the original words as an inverted index, we can now index the output of LSA. This helps avoid problems such as synonymy and polysemy. In synonymy, users may tend to use different words to represent the same entity. A normal indexing is vulnerable to such scenarios. As the underlying document is indexed by regular words, a search may not yield the results. For example, if we indexed some documents related to financial instruments, typically the words would be currency, money, and similar stuff. Currency and money are synonymous words. While a user searches for money, he should be shown the documents related to currency as well. However, with regular indexing, the search engine would be able to retrieve only the documents having money. With latent semantic indexing, the documents with currency will also be retrieved. In the latent semantic space, currency and money will be close to each other as their neighboring words would be similar in the documents.

Polysemy is about words that have more than one meaning. For example, bank can refer to a financial institution or a river bank. Similar to synonymy, polysemy can also be handled in the latent semantic space.

For more information on LSA and latent semantic indexing, refer to the paper by Deerwester et al at:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8490. For a comparative study of Eigen Values and Singular Values refer to the book Numerical Computing with MATLAB by Cleve Moler. Though the examples are in MATLAB, with the help of our recipe you can redo them in Python:

https://in.mathworks.com/moler/eigs.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset