Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13

Creating Basic Examples of Unsupervised Predictions

IN THIS CHAPTER

Working with a sample dataset

Creating simple predictive models using clustering algorithms

Visualizing and evaluating your results

This chapter is about creating a few simple predictive models using unsupervised learning with clustering algorithms such as K-means, DBSCAN, and mean shift. These examples use the Python programming language, version 2.7.4, on a Windows machine. See Chapter 12 if you need instructions on installing Python and the scikit-learn machine-learning package.

No prior knowledge of supervised learning is required to understand the concepts of unsupervised learning. Supervised learning is when the output categories are known in the historical data; unsupervised learning is when the output categories are unknown. Chapter 12 covers examples of supervised learning with classification and regression algorithms.

You can read Chapters 12 and 13 independently. One advantage of reading both chapters in the same session is that you'll be able to reuse the work that you did to load the Iris dataset into the Python interpreter (the command line where you enter the code statements or commands). So if you're continuing from Chapter 12, you may skip the next section.

Getting the Sample Dataset

The sample Iris dataset is included in the installation of scikit-learn — along with a set of functions that load data into the Python session.

To load the Iris dataset, follow these steps:

Open a new Python interactive shell session.

Use a new Python session so there isn't anything left over in memory and you have a clean slate to work with.
Paste the following code at the prompt and press Enter:
>>> from sklearn.datasets import load_iris >>> iris = load_iris()

After you run those two statements, you shouldn't see any messages from the interpreter. The variable iris should contain all the data from the iris.csv file.
Enter the following command to confirm that variable iris contains the data:
>>> iris

The command prints out a verbose description of the Iris dataset, followed by a list of all the data members. Please refer to Table 12-3 for the main properties and descriptions of the iris variable.

You don't use a training dataset for an unsupervised learning task because you normally don't know the outcomes. Hence the dataset isn't labeled and the clustering algorithm doesn't accept a target value in its creation.

Using Clustering Algorithms to Make Predictions

In general, the use of clustering algorithms to create an unsupervised learning model entails the following general steps:

Prepare and load the data.
Fit the model.
Visualize the clusters.
Tune the parameters.
Repeat Steps 2 to 4 until you get the clustering output that you think yields the best results.
Evaluate the model.

Comparing clustering models

Unsupervised learning has many challenges — including not knowing what to expect when you run an algorithm. Each algorithm will produce different results; until you try a variety of solutions, you can’t know which solution will work, but when you’ve really nailed it, you will know it when you see it. Either it will either provide value or it won’t. Because that is through the lens of the business problem, at the end of the process you will know.

In the case of the Iris dataset, you know what the outcomes should be; as a result, you can tweak the algorithms to produce the desired outcomes. In real-world datasets, you won't have this luxury. You'll have to depend on some prior knowledge of the data (or intuition by the domain expert) to decide which initialization parameters and algorithms to use as you create your model.

We are starting with the famous Iris dataset because it makes the patterns especially easy to see, but you will face more challenges in real world situations; the outcomes are unknown and the desired result is difficult to find. For example, in K-means, choosing the right number of clusters is the key problem. If you find the right number of clusters, your data will yield insights with which you can make highly accurate predictions. On the flip side, choosing the wrong number of clusters may yield subpar results.

K-means algorithm is a good choice for datasets that have a small number of clusters with proportional sizes and linearly separable data — and you can scale it up to use the algorithm on very large datasets.

Think of linearly separable data as a bunch of points in a graph that can be separated using a straight line. If the data isn't linearly separable, then more advanced versions of K-means will have to be employed — which will become more expensive computationally and may not be suitable for very large datasets. In its standard implementation, the complexity to compute the cluster centers and distances is low.

K-means is widely employed to solve big-data problems because it's simple to use, effective, and highly scalable. No wonder most commercial vendors use the K-means algorithm as a key component of their predictive analytics packages.

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and mean-shift implementations in scikit-learn don't require any user-defined initialization parameters to create an instance. You can override the default parameters during initialization if you want. Unfortunately, if you're using the default parameters, the algorithms can't provide a close match to the desired outcome.

That said, neither DBSCAN nor mean shift perform well with the Iris dataset. Even after exhaustive tweaking of the initialization parameters, it's still very hard to get an output that mimics the known outcomes for Iris. DBSCAN is better suited for datasets that have disproportional cluster sizes, and whose data can be separated in a non-linear fashion. While mean shift can handle arbitrary shapes and sizes very well, its standard implementation isn't very scalable, as it is a O(n²) algorithm, and may not work well on high-dimensional data. Like K-means, DBSCAN is scalable, but using it on very large datasets requires more memory and computing power. You get a closer look at DBSCAN and mean shift in action later in this chapter.

Creating an unsupervised learning model with K-means

The K-means algorithm normally expects one initialization parameter from the user in order to create an instance. It needs to know how many K clusters to use to perform its work. The K-means implementation in Python will use the default of K=8 if the user doesn't provide it.

Because you're using the Iris dataset, you already know that it has three clusters. As described in Chapter 12, the Iris dataset has three classes of the Iris flower (Setosa, Versicolor, and Virginica). In general, when you're creating an unsupervised learning task with a clustering algorithm, you wouldn't know how many clusters to specify. Some algorithms try to determine the best number of clusters by iterating through a range of clusters and then selecting the number of clusters that best fits its mathematical criteria.

The best way to get immediate results is to make an educated guess about the number of clusters to use — basing your estimate on features present in the data (whether one or multiple features), or on some other knowledge of the data you may have from the business domain expert.

This falling back on guesswork (even educated guesswork) is a major limitation of the K-means clustering algorithm. An upcoming section explores a couple of other clustering algorithms, DBSCAN and mean shift, that don't need the number of clusters in order to do their work.

Running the full dataset

To create an instance of the K-means clustering algorithm and run the data through it, type the following code in the interpreter.

>>> from sklearn.cluster import KMeans >>> kmeans = KMeans(n_clusters=3, random_state=111) >>> kmeans.fit(iris.data)

The first line of code imports the KMeans library into the session. The second line creates the model and stores it in a variable named kmeans. The model is created with the number of clusters set to 3. The third line fits the model to the Iris data. Fitting the model is the core part of the algorithm, where it will produce the three clusters with the given dataset and construct a mathematical function that describes the line or curve that best fits the data. To see the clusters that the algorithm produces, type the following code.

>>> kmeans.labels_

The output should look similar to this:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

This is how the K-means algorithm labels the data as belonging to clusters, without input from the user about the target values. Here the only thing K-means knew was what we provided it: the number of clusters. This result shows how the algorithm viewed the data, and what it learned about the relationships of data items to each other — hence the term unsupervised learning.

You can see right away that some of the data points were mislabeled. You know, from the Iris dataset, what the target values should be:

The first 50 observations should be labeled the same (as 1s in this case).

This range is known as the Setosa class.
Observations 51 to 100 should be labeled the same (as 0s in this case).

This range is known as the Versicolor class.
Observations 101 to 150 should be labeled the same (as 2s in this case).

This range is known as the Virginica class.

It doesn't matter whether K-means labeled each set of 50 with a 0, 1, or 2. As long as each set of 50 has the same label, it accurately predicted the outcome. It's up to you to give each cluster a name and to find meaning in each cluster. If you run the K-means algorithm again, it may produce an entirely different number for each set of 50 — but the meaning would be the same for each set (class).

You can create a K-means model that can generate the same output each time by passing the random_state parameter with a fixed seed value to the function that creates the model. The algorithm depends on randomness to initialize the cluster centers. Providing a fixed seed value takes away the randomness. Doing so essentially tells K-means to select the same initial data points to initialize the cluster centers, every time you run the algorithm. It's possible to get a different outcome by removing the random_state parameter from the function.

Visualizing the clusters

As mentioned in Chapter 12, the Iris dataset isn't easy to graph in its original form. Therefore, you have to reduce the number of dimensions by applying a dimensionality reduction algorithm that operates on all four fields and outputs two new numbers (that represent the original four fields) that you can use to do the plot.

You can see that the current shape of the Iris dataset is 150 rows, with each row having 4 fields. You can find the shape by entering this line of code:

>>> iris.data.shape (150, 4)

The following code will do the dimension reduction:

>>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=2).fit(iris.data) >>> pca_2d = pca.transform(iris.data)

After you run the dimensionality reduction code, you can see that the transformed iris data will be stored into a new variable name, pca_2d. You can verify that the shape has been transformed to two dimensions:

>>> pca_2d.shape (150, 2)

You can also type the pca_2d variable into the interpreter and it will output arrays (think of an array as a container that stores items in a list) with two fields instead of four. Now that you have the reduced feature set, you can plot the results with the following code:

>>> import pylab as pl >>> pl.figure('Figure 13-1') >>> for i in range(0, pca_2d.shape[0]): >>> if iris.target[i] == 0: >>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='+') >>> elif iris.target[i] == 1: >>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='o') >>> elif iris.target[i] == 2: >>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b', marker='*') >>> pl.legend([c1, c2, c3], ['Setosa', 'Versicolor', 'Virginica']) >>> pl.title('Iris dataset with 3 clusters and known outcomes') >>> pl.show()

FIGURE 13-1: Plotting data elements from the Iris dataset.

The output of this code is a plot that should be similar to Figure 13-1. This is a plot representing how the known outcomes of the Iris dataset should look like. It's what you would like the K-means clustering to achieve. The figure shows a scatter plot, which is a graph of plotted points representing an observation on a graph, of all 150 observations. As indicated on the graph plots and legend:

There are 50 pluses that represent the Setosa class.
There are 50 circles that represent the Versicolor class.
There are 50 stars that represent the Virginica class.

Figure 13-2 shows a visual representation of the data that we are asking K-means to cluster: a scatter plot with 150 data points that haven't been labeled (hence all the data points are the same color and shape). The K-means algorithm doesn't know any target outcomes. The actual data that we're going to run through the algorithm hasn't had its dimensionality reduced yet.

FIGURE 13-2: Visual representation of data fed into the K-means algorithm.

The following lines of code create this scatter plot, using the X and Y values of pca_2d and coloring all the data points black (c='black' sets the color to black).

>>> pl.figure('Figure 13-2') >>> pl.scatter(pca_2d[:,0],pca_2d[:,1],c='black') >>> pl.title('Iris dataset without labels as seen by K-means') >>> pl.show()

If you try fitting the two-dimensional data that was reduced by PCA, the K-means algorithm will fail to cluster the Virginica and Versicolor classes correctly. Using PCA to preprocess the data will destroy too much information that K-means needs.

After K-means has fitted the Iris data, you can make a scatter plot of the clusters that the algorithm produced; just run the following code:

>>> pl.figure('Figure 13-3') >>> for i in range(0, pca_2d.shape[0]): >>> if kmeans.labels_[i] == 1: >>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='+') >>> elif kmeans.labels_[i] == 0: >>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='o') >>> elif kmeans.labels_[i] == 2: >>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b', marker='*') >>> pl.legend([c1, c2, c3],['Cluster 1', 'Cluster 0', 'Cluster 2']) >>> pl.title('K-means clusters the Iris dataset into 3 clusters') >>> pl.show()

FIGURE 13-3: The K-means algorithm outputs three clusters.

Recall that K-means labeled the first 50 observations with the label of 1, the second 50 with label of 0, and the last 50 with the label of 2. In the preceding code, the lines with the if, elif, and legend statements reflects those labels. This change was made to make it easy to compare with the actual results.

The output of the scatter plot is shown in Figure 13-3.

Compare the K-means clustering output (shown in Figure 13-3) to the original scatter plot (refer to Figure 13-1) — which provides labels because the outcomes are known. You can see that the two plots resemble each other. The K-means algorithm did a pretty good job with the clustering. Although the predictions aren't perfect, they come close. That's a win for the algorithm.

In unsupervised learning, you rarely get an output that's 100 percent accurate because real-world data is rarely that simple. You won't know for sure how many clusters to choose (or other initialization parameter(s) for other clustering algorithms). You will have to handle outliers (data points that don't seem consistent with others) and complex datasets that are dense, highly dimensional, and not linearly separable.

You can only get to this point if you know how many clusters the dataset has. You don't need to worry about which features to use or reducing the dimensionality of a dataset that has so few features (in this case, four) to fit the model. We only reduced the dimensions for the sake of visualizing the data on a graph. We didn't fit the model with the dimensionality reduced dataset.

Here's the full listing of the code that creates both scatter plots and color-codes the data points:

>>> from sklearn.decomposition import PCA >>> from sklearn.cluster import KMeans >>> from sklearn.datasets import load_iris >>> import pylab as pl >>> iris = load_iris() >>> pca = PCA(n_components=2).fit(iris.data) >>> pca_2d = pca.transform(iris.data) >>> pl.figure('Reference Plot') >>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=iris.target) >>> kmeans = KMeans(n_clusters=3, random_state=111) >>> kmeans.fit(iris.data) >>> pl.figure('K-means with 3 clusters') >>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=kmeans.labels_) >>> pl.show()

Repeating the runs with a different K-value

A common outcome for clustering the Iris dataset is a two-cluster solution: one cluster contains the Setosa class and the other contains both the Versicolor and Virginica classes.

If you didn't have prior knowledge of how many clusters the Iris dataset has, you may have chosen to use two clusters with the K-means algorithm. With two clusters, K-means correctly clusters the Setosa class and combines the Virginica and the Versicolor classes into a single cluster.

The following code uses K-means to create two clusters, after which it displays a scatter plot of the results. Figure 13-4 shows the output of the K-means two-cluster solution.

>>> kmeans2 = KMeans(n_clusters=2, random_state=111) >>> kmeans2.fit(iris.data) >>> pl.figure('Figure 13-4') >>> for i in range(0, pca_2d.shape[0]): >>> if kmeans2.labels_[i] == 1: >>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='+') >>> elif kmeans2.labels_[i] == 0: >>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='o') >>> pl.legend([c1, c2], ['Cluster 1', 'Cluster 2']) >>> pl.title('K-means clusters the Iris dataset into 2 clusters') >>> pl.show()

FIGURE 13-4: Here’s a K-means output of two clusters.

At first glance, the results seem to be within reason — and a potential candidate you might use to create your model. In fact, if you did use those results to make your predictive model, your success rate would be around 67 percent — not bad for a very basic model that uses unsupervised learning and a wrong guess for the number of clusters. You would have expected the accuracy to be around 67 percent because the algorithm is very accurate at clustering the linearly separable Setosa class (33.3 percent of the data). Clustering the remaining data into a single class would automatically give it an additional 33.3-percent accuracy because it only has two possibilities to choose from.

A four-cluster solution may yield a result that has one large cluster on the left (Setosa) and one on the right that's separated into three clusters (as shown in Figure 13-5). As you start increasing the value of K (the number of clusters), however, your results become less meaningful.

FIGURE 13-5: Here's a K-means output of four clusters.

The following code creates a four-cluster model with K-means:

>>> kmeans4 = KMeans(n_clusters=4, random_state=111) >>> kmeans4.fit(iris.data) >>> pl.figure('Figure 13-5') >>> for i in range(0, pca_2d.shape[0]): >>> if kmeans4.labels_[i] == 1: >>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='+') >>> elif kmeans4.labels_[i] == 0: >>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='o') >>> elif kmeans4.labels_[i] == 2: >>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b', marker='*') >>> elif kmeans4.labels_[i] == 3: >>> c4 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='c', marker='^') >>> pl.legend([c1, c2, c3, c4], ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4']) >>> pl.title('K-means clusters the Iris dataset into 4 clusters') >>> pl.show()

Evaluating the model

When you've chosen your number of clusters and have set up the algorithm to populate the clusters, you have a predictive model. The following example uses the three-cluster model previously built in this chapter. You can make predictions based on new incoming data by calling the predict function of the K-means instance and passing in an array of observations. It looks like this:

>>> # to call the predict function with a single observation >>> kmeans.predict([[5.1, 3.5, 1.4, 0.2]]) array([1])

When the predict function finds the cluster center that the observation is closest to, it outputs the index of that cluster center's array. Python arrays are indexed at 0 (that is, the first item starts at 0). Observations closest to a cluster center will be grouped into that cluster. In this example, the K-means algorithm predicts that the observation belongs to Cluster 1 (Setosa in this case) — an easy prediction because the Setosa class is linearly separable and far away from the other two classes. Also, we just selected the very first observation from the dataset to make the prediction verifiable and easy to explain. You can see that the attributes of the observation we're trying to predict are very close to the second cluster center’s attributes (kmeans.cluster_centers_[1]).

The new observation that we are trying to predict [5.1, 3.5, 1.4, 0.2] is closest to the second cluster center [5.006, 3.418,1.464, 0.244].

To see all the cluster centers, type the following code:

>>> kmeans.cluster_centers_ array([[ 5.9016129 , 2.7483871 , 4.39354839, 1.43387097], [ 5.006 , 3.418 , 1.464 , 0.244 ], [ 6.85 , 3.07368421, 5.74210526, 2.07105263]])

You can also use the predict function to evaluate a set of observations, as shown here:

>>> # to call the predict method with a set of data points >>> kmeans.predict([[ 5.1, 3.5, 1.4, 0.2 ], [ 5.9, 3.0, 5.1, 1.8 ]]) array([1,0])

The result is an array with a list of predictions. The first observation is predicted to be Cluster 1, and the second is predicted to be Cluster 0.

To see the cluster labels that the K-means algorithm produces, type the following code:

>>> kmeans.labels_ array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

Although you know that the three-cluster solution is correct, don't be surprised if intuitively the two-cluster solution seems to look the best and the four-cluster solution also looks quite reasonable. If you increase the number of clusters beyond three, your predictions' success rate starts to break down. With a little bit of luck (and some educated guessing), you'll choose the best number of clusters. Consider the process as mixing a little bit of art with science. Even the algorithm itself uses randomness in its selection of the initial data points it uses to start each cluster. So even if you're guessing, you're in good company.

For this toy dataset, we can measure how well K-means clustered the 150 observations, because we know the outcomes. We can look at the labels K-means created and see how the clusters were formed. We know that the first 50 observations should be the same cluster (Cluster 1), the second 50 should be the same cluster (Cluster 0), and the last 50 should be the same cluster (Cluster 2). Here are the actual results:

K-means clustered the first 50 together.
For the second group of 50, a couple of 2’s were mixed in; for that group, the error was 2/50, or 4 percent.
For the third group of 50, there was a mix of 0’s and 2’s. The 0’s belonged to the second cluster. So 14 out of 50 (28 percent) were incorrectly clustered with this group.
The total error is 16/150, or 10.67 percent.

Evaluating the performance of an algorithm requires a label that represents the expected value and a predicted value to compare it with. Remember that when you apply a clustering algorithm to an unsupervised learning model, you don't know what the expected values are — and you don't give labels to the clustering algorithm. The algorithm puts data points into clusters on the basis of which data points are most similar to one another by finding the closest cluster center for each observation. For the Iris dataset, K-means has no concept of Setosa, Versicolor, or Virginica classes; it only knows it's supposed to cluster the data into three clusters and name them randomly from 0 to 2.

The purpose of unsupervised learning with clustering is for data exploration; to find meaningful relationships in the data, preferably where you couldn't have seen them otherwise. Did the model form distinct and interpretable groups of clusters for a market segmentation project? It's up to you to decide whether those relationships are a good basis for an actionable insight.

Creating an unsupervised learning model with DBSCAN

As mentioned earlier, DBSCAN is a popular clustering algorithm used as an alternative to K-means. It doesn't require that you input the number of clusters in order to run. But in exchange, you have to tune two other parameters. The scikit-learn implementation provides a default for the two parameters, eps and min_samples, but you're generally expected to tune those. The eps parameter is the maximum distance between two data points to be considered in the same neighborhood. The min_samples parameter is the minimum amount of data points in a neighborhood to be considered a cluster.

One advantage that DBSCAN has over K-means is that DBSCAN isn't restricted to a set number of clusters during initialization. The algorithm will determine a number of clusters based on the density of a region. Keep in mind, however, that the algorithm depends on the eps and min_samples parameters to figure out what the density of each cluster should be. The thinking is that these two parameters are much easier to choose for some clustering problems.

In practice, you should test with multiple clustering algorithms.

Because the DBSCAN algorithm has a built-in concept of noise, it's commonly used to detect outliers in the data — for example, fraudulent activity in credit cards, e-commerce, or insurance claims.