Principal component analysis

Principal component analysis (PCA) is a technique that helps define a smaller and more relevant set of features. The new features obtained from PCA are linear combinations (that is, rotation) of the current features, even if they are binary. After the rotation of the input space, the first vector of the output set contains most of the signal's energy (or, in other words, its variance). The second is orthogonal to the first, and it contains most of the remaining energy; the third is orthogonal to the first two vectors and contains most of the remaining energy, and so on. It's just like restructuring the information in the dataset by aggregating as much as possible of the information onto the initial vectors produced by the PCA.

In the (ideal) case of AWGN, the initial vectors contain all of the information of the input signal; the ones toward the end only contain noise. Moreover, since the output basis is orthogonal, you can decompose and synthesize an approximate version of the input dataset. The key parameter, which is used to decide how many basis vectors one can use, is the energy. Since the algorithm, under the hood, is for singular value decomposition, eigenvectors (the basis vectors) and eigenvalues (the standard deviation associated with that vector) are two terms that are often referred to when reading about PCA. Typically, the cardinality of the output set is the one that guarantees the presence of 95% (in some cases, 90 or 99% are needed) of the input energy (or variance). A rigorous explanation of PCA is beyond the scope of this book, and hence, we will just inform you about the guidelines on how to use this powerful tool in Python.

Here's an example on how to reduce the dataset to two dimensions. In the previous section, we deduced that 2 was a good choice for a dimensionality reduction; let's check if we were right:

In: from sklearn.decomposition import PCA
pca_2c = PCA(n_components=2)
X_pca_2c = pca_2c.fit_transform(iris.data)
X_pca_2c.shape

Out: (150, 2)

In: plt.scatter(X_pca_2c[:,0], X_pca_2c[:,1], c=iris.target, alpha=0.8,
s=60, marker='o', edgecolors='white')
plt.show()
pca_2c.explained_variance_ratio_.sum()

Out: 0.97763177502480336

When executing the code, you also get a scatterplot of the first two components:

Scatterplot of the first two components

We can immediately see that, after applying the PCA, the output set has only two features. This is because the PCA() object was called with the n_components parameter set to 2. An alternative way to obtain the same result would be to run PCA() for 1, 2, and 3 components and then conclude from the explained variance ratio and the visual inspection that for n_components = 2, we got the best result. Then, we will have evidence that when using two basis vectors, the output dataset contains almost 98% of the energy of the input signal, and in the schema, the classes are pretty much neatly separable. Each color is located in a different area of the two dimensional Euclidean space.

Please note that this process is automatic and you don't need to provide labels while training PCA. In fact, PCA is an unsupervised algorithm, and it does not require data related to the independent variable to rotate the projection basis.

For curious readers, the transformation matrix (which turns the initial dataset into the PCA- restructured one) can be seen with the help of the following code:

In: pca2c.components

Out: array([[ 0.36158968, -0.08226889, 0.85657211, 0.35884393],
[-0.65653988, -0.72971237, 0.1757674 , 0.07470647]])

The transformation matrix is comprised of four columns (which is the number of input features) and two rows (which is the number of the reduced ones).

Sometimes, you will find yourself in a situation where PCA is not effective enough, especially when dealing with high dimensionality data, since the features may be very correlated and, at the same time, the variance is unbalanced. A possible solution for such a situation is to try to whiten the signal (or make it more spherical). In this occurrence, eigenvectors are forced to unit component-wise variances. Whitening removes information, but sometimes it improves the accuracy of the machine learning algorithms that will be used after the PCA's reduction. Here's what the code looks like when resorting to whitening (in our example, it doesn't change anything except for the scale of the dataset with the reduced output):

In: pca_2cw = PCA(n_components=2, whiten=True)
X_pca_1cw = pca_2cw.fit_transform(iris.data)
plt.scatter(X_pca_1cw[:,0], X_pca_1cw[:,1], c=iris.target, alpha=0.8,
s=60, marker='o', edgecolors='white')
plt.show()
pca_2cw.explained_variance_ratio_.sum()

Out: 0.97763177502480336

You also get the scatterplot of the first components of the PCA using whitening:

Now, let's see what happens if we project the input dataset on a 1-D space that's generated with PCA, as follows:

In: pca_1c = PCA(n_components=1)
X_pca_1c = pca_1c.fit_transform(iris.data)
plt.scatter(X_pca_1c[:,0], np.zeros(X_pca_1c.shape),
c=iris.target, alpha=0.8, s=60, marker='o', edgecolors='white')
plt.show()
pca_1c.explained_variance_ratio_.sum()

Out: 0.9246162071742684

The projection is distributed along a single horizontal line:

In this case, the output energy is lower (92.4% of the original signal), and the output points are added to the mono-dimensional Euclidean space. This might not be a great feature reduction step since many points with different labels are mixed together.

Finally, here's a trick. To ensure that you generate an output set containing at least 95% of the input energy, you can just specify this value to the PCA object during its first call. A result equal to the one with two vectors can be obtained with the following code:
In: pca_95pc = PCA(n_components=0.95)
X_pca_95pc = pca_95pc.fit_transform(iris.data)
print (pca_95pc.explained_variance_ratio_.sum())
print (X_pca_95pc.shape)

Out: 0.977631775025
(150, 2)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset