Unsupervised learning algorithms

There are two tasks that we are mostly concerned with in unsupervised learning: dimensionality reduction and clustering.

Dimensionality reduction

Dimensionality reduction is used to help visualize higher-dimensional data in a systematic way. This is useful because our human brains can visualize only three spatial dimensions (and possibly, a temporal one), but most datasets involve much higher dimensions.

The typical technique used in dimensionality reduction is Principal Component Analysis (PCA). PCA involves using linear algebra techniques to project higher-dimensional data onto a lower-dimensional space. This inevitably involves the loss of information, but often by projecting along the correct set and number of dimensions, the information loss can be minimized. A common dimensionality reduction technique is to find the combination of variables that explain the most variance (proxy for information) in our data and project along these dimensions.

In the case of unsupervised learning problems, we do not have the set of labels (Y), and so, we only call fit() on the input data X itself, and for PCA, we call transform() instead of predict() as we're trying to transform the data into a new representation.

One of the datasets that we will be using to demonstrate USL is the iris dataset, possibly the most famous dataset in all of machine learning.

The scikit-learn library provides a set of pre-packaged datasets, which are available via the sklearn.datasets modules. The iris dataset is one of them.

The iris dataset consists of 150 samples of data from three different species of iris flowers - versicolor, setosa, and virginica with 50 samples of each type. The dataset consists of four features/dimensions:

  • petal length
  • petal width
  • sepal length
  • sepal width

The length and width values are in centimeters. It can be loaded as follows:

from sklearn.datasets import load_iris 
iris = load_iris()

In our examination of unsupervised learning, we will be focusing on how to visualize and cluster this data.

Before discussing unsupervised learning, let us examine the iris data a bit. The load_iris() command returns what is known as a bunch object, which is essentially a dictionary with keys in addition to the key containing the data. Hence, we have the following:

In [2]: iris_data.keys()
Out[2]: ['target_names', 'data', 'target', 'DESCR', 'feature_names']

Further, the data itself looks similar to the following:

In [3]: iris_data.data.shape
Out[3]: (150, 4)

This corresponds to 150 samples of four features. These four features are shown as follows:

In [4]: print iris_data.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

We can also take a peek at the actual data:

In [9]: print iris_data.data[:2]
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]]

Our target names (what we're trying to predict) look similar to the following:

In [10]: print iris_data.target_names
         ['setosa' 'versicolor' 'virginica']

As noted earlier, the iris feature set corresponds to five-dimensional data and we cannot visualize this on a color plot. One thing that we can do is pick two features and plot them against each other, while using color to differentiate between the species feature. We do this next for all the possible combinations of features, selecting two at a time for a set of six different possibilities. These combinations are as follows:

  • Sepal width versus sepal length
  • Sepal width versus petal width
  • Sepal width versus petal length
  • Sepal length versus petal width
  • Sepal length versus petal length
  • Petal width versus petal length
Dimensionality reduction

The code for this may be found in the following file: display_iris_dimensions.py. From the preceding plots, we can observe that the setosa points tend to be clustered by themselves, while there is a bit of overlap between the virginica and the versicolor points. This may lead us to conclude that the latter two species are more closely related to one another than to the setosa species.

These are, however, two-dimensional slices of data. What if we wanted a somewhat more holistic view of the data, with some representation of all four sepal and petal dimensions?

What if there were some hitherto undiscovered connection between the dimensions that our two-dimensional plot wasn't showing? Is there a means of visualizing this? Enter dimensionality reduction. We will use dimensionality reduction to extract two combinations of sepal and petal dimensions to help visualize it.

We can apply dimensionality reduction to do this as follows:

In [118]: X, y = iris_data.data, iris_data.target
              from sklearn.decomposition import PCA
              pca = PCA(n_components=2)
              pca.fit(X)
              X_red=pca.transform(X)
              print "Shape of reduced dataset:%s" % str(X_red.shape)

         Shape of reduced dataset:(150, 2)

Thus, we see that the reduced dataset is now in two dimensions. Let us display the data visually in two dimensions as follows:

In [136]: figsize(8,6)
          fig=plt.figure()
          fig.suptitle("Dimensionality reduction on iris data")
          ax=fig.add_subplot(1,1,1)
          colors=['red','yellow','magenta']
          cols=[colors[i] for i in iris_data.target]
          ax.scatter(X_red[:,0],X[:,1],c=cols)
Out[136]:
<matplotlib.collections.PathCollection at 0x7fde7fae07d0>
Dimensionality reduction

We can examine the makeup of the PCA-reduced two dimensions as follows:

In [57]:
print "Dimension Composition:"
idx=1
for comp in pca.components_:
    print "Dim %s" % idx
    print " + ".join("%.2f x %s" % (value, name)
                     for value, name in zip(comp, iris_data.feature_names))
    idx += 1

Dimension Composition:
Dim 1
0.36 x sepal length (cm) + -0.08 x sepal width (cm) + 0.86 x petal length (cm) + 0.36 x petal width (cm)
Dim 2
-0.66 x sepal length (cm) + -0.73 x sepal width (cm) + 0.18 x petal length (cm) + 0.07 x petal width (cm)

Thus, we can see that the two reduced dimensions are a linear combination of all four sepal and petal dimensions.

The source of this information is at: https://github.com/jakevdp/sklearn_pycon2014.

K-means clustering

The idea behind clustering is to group together similar points in a dataset on the basis of a given criterion, thus finding clusters in the data.

The K-means algorithm aims to partition a set of data points into K clusters such that each data point belongs to the cluster with the nearest mean point or centroid.

To illustrate K-means clustering, we can apply it to the set of reduced iris data that we obtained via PCA, but in this case, we do not pass the actual labels to the fit(..) method as we do for supervised learning:

In [142]: from sklearn.cluster import KMeans
          k_means = KMeans(n_clusters=3, random_state=0)
          k_means.fit(X_red)
          y_pred = k_means.predict(X_red)

We now display the clustered data as follows:

In [145]: figsize(8,6)
          fig=plt.figure()
          fig.suptitle("K-Means clustering on PCA-reduced iris data, K=3")
          ax=fig.add_subplot(1,1,1)
          ax.scatter(X_red[:, 0], X_red[:, 1], c=y_pred);
K-means clustering

Note that our K-means algorithm clusters do not exactly correspond to the dimensions obtained via PCA. The source code is available at https://github.com/jakevdp/sklearn_pycon2014.

Note

More information on K-means clustering in scikit-learn and, in general, can be found here at:

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html and http://en.wikipedia.org/wiki/K-means_clustering.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset