Principal component analysis

Principal component analysis (PCA) transforms the attributes of unlabeled data using a simple rearrangement and transformation with rotation. Looking at the data that does not have any significance, you can find ways to reduce dimensions this way. For instance, when a particular dataset looks similar to an ellipse when run at a particular angle to the axes, while in another transformed representation moves along the x axis and clearly has signs of no variation along the y axis, then it may be possible to ignore that.

k-means clustering is appropriate to cluster unlabeled data. Sometimes, one can use PCA to project data to a much lower dimension and then apply other methods, such as k-means, to a smaller and reduced data space.

However, it is very important to perform dimension reduction carefully because any dimension reduction may lead to the loss of information, and it is crucial that the algorithm preserves the useful part of the data while discarding the noise. Here, we will motivate PCA from at least two perspectives and explain why preserving maximal variability makes sense:

  • Correlation and redundancy
  • Visualization

Suppose that we did collect data about students on a campus that involves details about gender, height, weight, tv time, sports time, study time, GPA, and so on. While performing the survey about these students using these dimensions, we figured that the height and weight correlation yields an interesting theory (usually, the taller the student, the more weight due to the bone weight and vice versa). This may probably not be the case in a bigger set of population (more weight does not necessarily mean taller). The correlation can also be visualized as follows:

Principal component analysis
import matplotlib.pyplot as plt
import csv

gender=[]
x=[]
y=[]
with open('/Users/kvenkatr/height_weight.csv', 'r') as csvf:
  reader = csv.reader(csvf, delimiter=',')
  count=0
  for row in reader:
    if count > 0:
        if row[0] == "f": gender.append(0)
        else:  gender.append(1)
        height = float(row[1])
        weight = float(row[2])
        x.append(height)
        y.append(weight)
    count += 1

plt.figure(figsize=(11,11))
plt.scatter(y,x,c=gender,s=300)
plt.grid(True)
plt.xlabel('Weight', fontsize=18)
plt.ylabel('Height', fontsize=18)
plt.title("Height vs Weight (College Students)", fontsize=20)
plt.legend()

plt.show()

Using sklearn again with the preprocessing, datasets, and decomposition packages, you can write a simple visualization code as follows:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

data = load_iris()
X = data.data

# convert features in column 1 from cm to inches
X[:,0] /= 2.54
# convert features in column 2 from cm to meters
X[:,1] /= 100
from sklearn.decomposition import PCA

def scikit_pca(X):

    # Standardize
    X_std = StandardScaler().fit_transform(X)

    # PCA
    sklearn_pca = PCA(n_components=2)
    X_transf = sklearn_pca.fit_transform(X_std)

    # Plot the data
    plt.figure(figsize=(11,11))
    plt.scatter(X_transf[:,0], X_transf[:,1], s=600, color='#8383c4', alpha=0.56)
    plt.title('PCA via scikit-learn (using SVD)', fontsize=20)
    plt.xlabel('Petal Width', fontsize=15)
    plt.ylabel('Sepal Length', fontsize=15)
    plt.show()

scikit_pca(X)

This plot shows PCA using the scikit-learn package:

Principal component analysis

Installing scikit-learn

The following command will help the installation of the scikit-learn package:

$ conda install scikit-learn
Fetching package metadata: ....
Solving package specifications: .
Package plan for installation in environment /Users/myhomedir/anaconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    nose-1.3.7                 |           py27_0         194 KB
    setuptools-18.0.1          |           py27_0         341 KB
    pip-7.1.0                  |           py27_0         1.4 MB
    scikit-learn-0.16.1        |       np19py27_0         3.3 MB
    ------------------------------------------------------------
                                           Total:         5.2 MB

The following packages will be UPDATED:

    nose:         1.3.4-py27_1      --> 1.3.7-py27_0     
    pip:          7.0.3-py27_0      --> 7.1.0-py27_0     
    scikit-learn: 0.15.2-np19py27_0 --> 0.16.1-np19py27_0
    setuptools:   17.1.1-py27_0     --> 18.0.1-py27_0    

Proceed ([y]/n)? y
Fetching packages ...

For anaconda, as the CLI is all via conda, one can install it using conda. For other ways, by default, one would always attempt to use pip install. However, in any case, you should check the documentation for installation. As all the scikit-learn packages are pretty popular and have been around for a while, not much has changed. Now, in the following section, we will explore k-means clustering to conclude this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset