Using Kernel PCA

PCA makes an assumption that all the principal directions of variation in the data are straight lines. This is not true in a lot of real-world datasets.

Note

PCA is limited to only those variables where the variation in the data falls in a straight line. In other words, it works only with linearly separable data.

In this section, we will look at kernel PCA, which will help us reduce the dimension of datasets where the variations in them are not straight lines. We will explicitly create such a dataset and apply kernel PCA on it.

In kernel PCA, a kernel function is applied to all the data points. This transforms the input data into kernel space. A normal PCA is performed in the kernel space.

Getting ready

We will not use the Iris dataset here, but will generate a dataset where variations are not straight lines. This way, we cannot apply a simple PCA on this dataset. Let's proceed to look at our recipe.

How to do it…

Let's load the necessary libraries. We will proceed to make a dataset using the make_circles function from the scikit-learn library. We will plot this data and do a normal PCA on this dataset:

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA


# Generate a dataset where the variations cannot be captured by a straight line.
np.random.seed(0)
x,y = make_circles(n_samples=400, factor=.2,noise=0.02)

# Plot the generated dataset
plt.close('all')
plt.figure(1)
plt.title("Original Space")
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")

# Try to fit the data using normal PCA
pca = PCA(n_components=2)
pca.fit(x)
x_pca = pca.transform(x)

We will then plot the first two principal components of this dataset. We will plot the dataset using only the first principal component:

plt.figure(2)
plt.title("PCA")
plt.scatter(x_pca[:,0],x_pca[:,1],c=y)
plt.xlabel("$Component_1$")
plt.ylabel("$Component_2$")

# Plot using the first component from normal pca
class_1_indx = np.where(y==0)[0]
class_2_indx = np.where(y==1)[0]

plt.figure(3)
plt.title("PCA- One component")
plt.scatter(x_pca[class_1_indx,0],np.zeros(len(class_1_indx)),color='red')
plt.scatter(x_pca[class_2_indx,0],np.zeros(len(class_2_indx)),color='blue')

Let's finish it up by performing a kernal PCA and plotting the components:

# Create  KernelPCA object in Scikit learn, specifying a type of kernel as a parameter.
kpca = KernelPCA(kernel="rbf",gamma=10)
# Perform KernelPCA
kpca.fit(x)
x_kpca = kpca.transform(x)


# Plot the first two components.
plt.figure(4)
plt.title("Kernel PCA")
plt.scatter(x_kpca[:,0],x_kpca[:,1],c=y)
plt.xlabel("$Component_1$")
plt.ylabel("$Component_2$")
plt.show()

How it works…

In step 1, we generated a dataset using the scikit's data generation function. In this case, we used the make_circles function. We can create two concentric circles, a large one containing the smaller one, using this function. Each concentric circle belongs to a certain class. Thus, we created a two class problem with two concentric circles.

First, let's look at the data that we generated. The make_circles function generated a dataset of size 400 with two dimensions. A plot of the original data is as follows:

How it works…

This chart describes how our data has been distributed. The outer circle belongs to class one and the inner circle belongs to class two. Is there a way we can take this data and use it with a linear classifier? We will not be able to do it. The variations in the data are not straight lines. We cannot use the normal PCA. Hence, we will resort to a kernel PCA in order to transform the data.

Before we venture into kernel PCA, let's see what happens if we apply a normal PCA on this dataset.

Let's look at the output plot of the first two components:

How it works…

As you can see, the components of PCA are unable to distinguish between the two classes in a linear fashion.

Let's plot the first component and see its class distinguishing ability. The following graph, where we have plotted only the first component, explains how PCA is unable to differentiate the data:

How it works…

The normal PCA approach is a linear projection technique that works well if the data is linearly separable. In cases where the data is not linearly separable, a nonlinear technique is required for the dimensionality reduction of the dataset.

Note

Kernel PCA is a nonlinear technique for data reduction.

Let's proceed to create a kernel PCA object using the scikit-learn library. Here is our object creation code:

KernelPCA(kernel=rbf,gamma=10) 

We selected the Radial Basis Function (RBF) kernel with a gamma value of ten. Gamma is the parameter of the kernel (to handle nonlinearity)—the kernel coefficient.

Before we go further, let's look at a little bit of theory about what kernels really are. As a simple definition, a kernel is a function that computes the dot product, that is, the similarity between two vectors, which are passed to it as input.

The RBFGaussian kernel is defined as follows for two points, x and x' in some input space:

How it works…

Where,

How it works…

The RBF decreases with distance and takes values between 0 and 1. Hence it can be interpreted as a similarity measure. The feature space of the RBF kernel has infinite dimensions –Wikipedia.

This can be found at:

http://en.wikipedia.org/wiki/Radial_basis_function_kernel.

Let's now transform the input from the feature space into the kernel space. We will perform a PCA in the kernel space.

Finally, we will plot the first two principal components as a scatter plot. The points are colored based on their class value:

How it works…

You can see in this graph that the points are linearly separated in the kernel space.

There's more…

Scikit-learn's kernel PCA object also allows other types of kernels, as follows:

  • Linear
  • Polynomial
  • Sigmoid
  • Cosine
  • Precomputed

Scikit-learn also provides other types of nonlinear data that is generated. The following is another example:

from sklearn.datasets import make_moons
x,y = make_moons(100)
plt.figure(5)
plt.title("Non Linear Data")
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.savefig('fig-7.png')
plt.show()

The data plot looks as follows:

There's more…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset