Most of the techniques in statistics are linear by nature, so in order to capture nonlinearity, we might need to apply some transformation. PCA is, of course, a linear transformation. In this recipe, we'll look at applying nonlinear transformations, and then apply PCA for dimensionality reduction.
Life would be so easy if data was always linearly separable, but unfortunately it's not. Kernel PCA can help to circumvent this issue. Data is first run through the kernel function that projects the data onto a different space; then PCA is performed.
To familiarize yourself with the kernel functions, it will be a good exercise to think of how to generate data that is separable by the kernel functions available in the kernel PCA. Here, we'll do that with the cosine kernel. This recipe will have a bit more theory than the previous recipes.
The cosine kernel works by comparing the angle between two samples represented in the feature space. It is useful when the magnitude of the vector perturbs the typical distance measure used to compare samples.
As a reminder, the cosine between two vectors is given by the following:
This means that the cosine between A and B is the dot product of the two vectors normalized by the product of the individual norms. The magnitude of vectors A and B have no influence on this calculation.
So, let's generate some data and see how useful it is. First, we'll imagine there are two different underlying processes; we'll call them A and B:
>>> import numpy as np >>> A1_mean = [1, 1] >>> A1_cov = [[2, .99], [1, 1]] >>> A1 = np.random.multivariate_normal(A1_mean, A1_cov, 50) >>> A2_mean = [5, 5] >>> A2_cov = [[2, .99], [1, 1]] >>> A2 = np.random.multivariate_normal(A2_mean, A2_cov, 50) >>> A = np.vstack((A1, A2)) >>> B_mean = [5, 0] >>> B_cov = [[.5, -1], [-.9, .5]] >>> B = np.random.multivariate_normal(B_mean, B_cov, 100)
Once plotted, it will look like the following:
By visual inspection, it seems that the two classes are from different processes, but separating them in one slice might be difficult. So, we'll use the kernel PCA with the cosine kernel discussed earlier:
>>> kpca = decomposition.KernelPCA(kernel='cosine', n_components=1) >>> AB = np.vstack((A, B)) >>> AB_transformed = kpca.fit_transform(AB)
Visualized in one dimension after the kernel PCA, the dataset looks like the following:
Contrast this with PCA without a kernel:
There are several different kernels available as well as the cosine kernel. You can even write your own kernel function. The available kernels are:
poly
(polynomial)rbf
(radial basis function)sigmoid
cosine
precomputed
There are also options contingent of the kernel choice. For example, the degree argument will specify the degree for the poly
, rbf
, and sigmoid
kernels; also, gamma will affect the rbf
or poly
kernels.
The recipe on SVM will cover the rbf
kernel function in more detail.
A word of caution: kernel methods are great to create separability, but they can also cause overfitting if used without care.