Applying PCA

Let's consider the following artificial dataset, which is visualized in the following left-hand plot diagram:

>>> x1 = np.arange(0, 10, .2)
>>> x2 = x1+np.random.normal(loc=0, scale=1, size=len(x1))
>>> X = np.c_[(x1, x2)]
>>> good = (x1>5) | (x2>5) # some arbitrary classes
>>> bad = ~good

Scikit-learn provides the PCA class in its decomposition package. In this example, we can clearly see that one dimension should be enough to describe the data. We can specify this using the n_components parameter:

>>> from sklearn import linear_model, decomposition, datasets
>>> pca = decomposition.PCA(n_components=1)

Also, here, we can use the fit() and transform() methods of pca (or its fit_transform() combination) to analyze the data and project it in the transformed feature space:

>>> Xtrans = pca.fit_transform(X)

As we have specified, Xtrans contains only one dimension. You can see the result in the preceding right-hand plot diagram. The outcome is even linearly separable in this case. We would not even need a complex classifier to distinguish between both classes.

To get an understanding of the reconstruction error, we can have a look at the variance of the data that we have retained in the transformation:

>>> print(pca.explained_variance_ratio_)
>>> [ 0.96393127]

This means that, after going from two dimensions to one, we are still left with 96 percent of the variance.

Of course, it's not always this simple. Frequently, we don't know what number of dimensions is advisable upfront. In that case, we leave the n_components parameter unspecified when initializing PCA to let it calculate the full transformation. After fitting the data, explained_variance_ratio_ contains an array of ratios in decreasing order: the first value is the ratio of the basis vector describing the direction of the highest variance, the second value is the ratio of the direction of the second highest variance, and so on. After plotting this array, we quickly get a feel of how many components we would need: the number of components immediately before the chart elbow is often a good guess.

Plots displaying the explained variance over the number of components are called scree plots. A nice example of combining a scree plot with a grid search to find the best setting for the classification problem can be found at http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html.

Table of Contents for Applying PCA

Create new playlist

Sign In

Sign Up

Table of Contents for
Applying PCA