Reducing the data dimension with random projection

The methods that we saw previously for dimensionality reduction are computationally expensive and not the fastest ones. Random projection is another way to perform dimensionality reduction faster than these methods.

Random projections are attributed to the Johnson-Lindenstrauss lemma. According to the lemma, a mapping from a high-dimensional to a low-dimensional Euclidean space exists; such that the distance between the points is preserved within an epsilon variance. The goal is to preserve the pairwise distances between any two points in your data, and still reduce the number of dimensions in the data.

Let's say that if we are given n-dimensional data in any Euclidean space, according to the lemma, we can map it an Euclidean space of dimension k, such that all the distances between the points are preserved up to a multiplicative factor of (1-epsilon) and (1+ epsilon).

Getting ready

For this exercise, we will use the 20 newsgroup data (http://qwone.com/~jason/20Newsgroups/).

It is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different categories of news. Scikit-learn provides a convenient function to load this dataset:

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(categories=cat)

You can load all libraries or a list of categories of interest by providing a list of strings of categories. In our case, we will use the sci.crypt category.

We will load the input text as a term-document matrix where the features are individual words. On this, we will apply random projection in order to reduce the number of dimensions. We will try to see if the distances between the documents are preserved in the reduced space and an instance is a document.

How to do it…

Let's start with loading the necessary libraries. Using scikit's utility function fetch20newsgroups, we will load the data. We will select only the sci.crypt category out of all the data. We will then transform our text data into a vector representation:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import euclidean_distances
from sklearn.random_projection import GaussianRandomProjection
import matplotlib.pyplot as plt
import numpy as np

# Load 20 newsgroup dataset
# We select only sci.crypt category
# Other categories include
# 'sci.med', 'sci.space' ,'soc.religion.christian'
cat =['sci.crypt']
data = fetch_20newsgroups(categories=cat)

# Create a term document matrix, with term frequencies as the values
# from the above dataset.
vectorizer = TfidfVectorizer(use_idf=False)
vector = vectorizer.fit_transform(data.data)

# Perform the projection. In this case we reduce the dimension to 1000
gauss_proj = GaussianRandomProjection(n_components=1000)
gauss_proj.fit(vector)
# Transform the original data to the new space
vector_t = gauss_proj.transform(vector)

# print transformed vector shape
print vector.shape
print vector_t.shape

# To validate if the transformation has preserved the distance, we calculate the old and the new distance between the points
org_dist = euclidean_distances(vector)
red_dist = euclidean_distances(vector_t)

diff_dist = abs(org_dist - red_dist)

# We take the difference between these points and plot them 
# as a heatmap (only the first 100 documents).
plt.figure()
plt.pcolor(diff_dist[0:100,0:100])
plt.colorbar()
plt.show()

Let's now proceed to demonstrate the concept of random projection.

How it works…

After loading the newsgroup dataset, we will convert it to a matrix through TfidfVectorizer(use_idf=False).

Notice that we have set use_idf to False. This creates our input matrix where the rows are documents, columns are individual words, and cell values are word counts.

If we print our vector using the print vector.shape command, we will get the following output:

(595, 16115)

We can see that our input matrix has 595 documents and 16115 words; each word is a feature and, hence, a dimension.

We will perform the projection of the data using a dense Gaussian matrix. The Gaussian random matrix is generated by sampling elements from a normal distribution N(0, 1/number of components). In our case, the number of components is 1000. Our intention is to reduce the dimension to 1000 from 16115. We will then print the original and the reduced dimension in order to verify the reduction in the dimension.

Finally, we would like to validate whether the data characteristics are maintained after the projection. We will calculate the Euclidean distances between the vectors. We will record the distances in the original space and in the projected space as well. We will take a difference between them as in step 7 and plot the difference as a heat map:

How it works…

As you can see, the gradient is in the range of 0.000 to 0.105 and indicates the difference in distance of vectors in the original and reduced space. The difference between the distance in the original space and projected space are pretty much in a very small range.

There's more…

There are a lot of references for random projections. It is a very active field of research. Interested readers can refer to the following papers:

Experiments with random projections:

http://dl.acm.org/citation.cfm?id=719759

Experiments with random projections for machine learning:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.9205

In our recipe, we used the Gaussian random projection where a Gaussian random matrix was generated by sampling from a normal distribution, N(0,1/1000), where 1000 is the required dimension of the reduced space.

However, having a dense matrix can create severe memory-related issues while processing. In order to avoid this, Achlioptas proposed sparse random projections. Instead of choosing from a standard normal distribution, the entries are picked from {-1.0,1} with a probability of {1/6,2/3,1/6}. As you can see, the probability of having 0 is two-thirds and, hence, the resultant matrix will be sparse. Users can refer to the seminal paper by Achlioptas at Dimitris Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003.

The scikit implementation allows the users to choose the density of the resultant matrix. Let's say that if we specify the density as d and s as 1/d, then the elements of the matrix are picked from the following equation:

There's more…

With probability of the following:

There's more…

See also

  • Using kernel PCA recipe in Chapter 4, Analyzing Data - Deep Dive
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset