Chapter 13
IN THIS CHAPTER
Working with a sample dataset
Creating simple predictive models using clustering algorithms
Visualizing and evaluating your results
This chapter is about creating a few simple predictive models using unsupervised learning with clustering algorithms such as K-means, DBSCAN, and mean shift. These examples use the Python programming language, version 2.7.4, on a Windows machine. See Chapter 12 if you need instructions on installing Python and the scikit-learn
machine-learning package.
No prior knowledge of supervised learning is required to understand the concepts of unsupervised learning. Supervised learning is when the output categories are known in the historical data; unsupervised learning is when the output categories are unknown. Chapter 12 covers examples of supervised learning with classification and regression algorithms.
You can read Chapters 12 and 13 independently. One advantage of reading both chapters in the same session is that you'll be able to reuse the work that you did to load the Iris dataset into the Python interpreter (the command line where you enter the code statements or commands). So if you're continuing from Chapter 12, you may skip the next section.
The sample Iris dataset is included in the installation of scikit-learn
— along with a set of functions that load data into the Python session.
To load the Iris dataset, follow these steps:
Open a new Python interactive shell session.
Use a new Python session so there isn't anything left over in memory and you have a clean slate to work with.
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
After you run those two statements, you shouldn't see any messages from the interpreter. The variable iris
should contain all the data from the iris.csv
file.
iris
contains the data:
>>> iris
The command prints out a verbose description of the Iris dataset, followed by a list of all the data members. Please refer to Table 12-3 for the main properties and descriptions of the iris variable.
In general, the use of clustering algorithms to create an unsupervised learning model entails the following general steps:
Unsupervised learning has many challenges — including not knowing what to expect when you run an algorithm. Each algorithm will produce different results; until you try a variety of solutions, you can’t know which solution will work, but when you’ve really nailed it, you will know it when you see it. Either it will either provide value or it won’t. Because that is through the lens of the business problem, at the end of the process you will know.
In the case of the Iris dataset, you know what the outcomes should be; as a result, you can tweak the algorithms to produce the desired outcomes. In real-world datasets, you won't have this luxury. You'll have to depend on some prior knowledge of the data (or intuition by the domain expert) to decide which initialization parameters and algorithms to use as you create your model.
We are starting with the famous Iris dataset because it makes the patterns especially easy to see, but you will face more challenges in real world situations; the outcomes are unknown and the desired result is difficult to find. For example, in K-means, choosing the right number of clusters is the key problem. If you find the right number of clusters, your data will yield insights with which you can make highly accurate predictions. On the flip side, choosing the wrong number of clusters may yield subpar results.
K-means algorithm is a good choice for datasets that have a small number of clusters with proportional sizes and linearly separable data — and you can scale it up to use the algorithm on very large datasets.
Think of linearly separable data as a bunch of points in a graph that can be separated using a straight line. If the data isn't linearly separable, then more advanced versions of K-means will have to be employed — which will become more expensive computationally and may not be suitable for very large datasets. In its standard implementation, the complexity to compute the cluster centers and distances is low.
K-means is widely employed to solve big-data problems because it's simple to use, effective, and highly scalable. No wonder most commercial vendors use the K-means algorithm as a key component of their predictive analytics packages.
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and mean-shift implementations in scikit-learn
don't require any user-defined initialization parameters to create an instance. You can override the default parameters during initialization if you want. Unfortunately, if you're using the default parameters, the algorithms can't provide a close match to the desired outcome.
That said, neither DBSCAN nor mean shift perform well with the Iris dataset. Even after exhaustive tweaking of the initialization parameters, it's still very hard to get an output that mimics the known outcomes for Iris. DBSCAN is better suited for datasets that have disproportional cluster sizes, and whose data can be separated in a non-linear fashion. While mean shift can handle arbitrary shapes and sizes very well, its standard implementation isn't very scalable, as it is a O(n2) algorithm, and may not work well on high-dimensional data. Like K-means, DBSCAN is scalable, but using it on very large datasets requires more memory and computing power. You get a closer look at DBSCAN and mean shift in action later in this chapter.
The K-means algorithm normally expects one initialization parameter from the user in order to create an instance. It needs to know how many K clusters to use to perform its work. The K-means implementation in Python will use the default of K=8 if the user doesn't provide it.
Because you're using the Iris dataset, you already know that it has three clusters. As described in Chapter 12, the Iris dataset has three classes of the Iris flower (Setosa, Versicolor, and Virginica). In general, when you're creating an unsupervised learning task with a clustering algorithm, you wouldn't know how many clusters to specify. Some algorithms try to determine the best number of clusters by iterating through a range of clusters and then selecting the number of clusters that best fits its mathematical criteria.
The best way to get immediate results is to make an educated guess about the number of clusters to use — basing your estimate on features present in the data (whether one or multiple features), or on some other knowledge of the data you may have from the business domain expert.
This falling back on guesswork (even educated guesswork) is a major limitation of the K-means clustering algorithm. An upcoming section explores a couple of other clustering algorithms, DBSCAN and mean shift, that don't need the number of clusters in order to do their work.
To create an instance of the K-means clustering algorithm and run the data through it, type the following code in the interpreter.
>>> from sklearn.cluster import KMeans
>>> kmeans = KMeans(n_clusters=3, random_state=111)
>>> kmeans.fit(iris.data)
The first line of code imports the KMeans
library into the session. The second line creates the model and stores it in a variable named kmeans
. The model is created with the number of clusters set to 3. The third line fits the model to the Iris data. Fitting the model is the core part of the algorithm, where it will produce the three clusters with the given dataset and construct a mathematical function that describes the line or curve that best fits the data. To see the clusters that the algorithm produces, type the following code.
>>> kmeans.labels_
The output should look similar to this:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2,
0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2,
2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])
This is how the K-means algorithm labels the data as belonging to clusters, without input from the user about the target values. Here the only thing K-means knew was what we provided it: the number of clusters. This result shows how the algorithm viewed the data, and what it learned about the relationships of data items to each other — hence the term unsupervised learning.
You can see right away that some of the data points were mislabeled. You know, from the Iris dataset, what the target values should be:
The first 50 observations should be labeled the same (as 1s in this case).
This range is known as the Setosa class.
Observations 51 to 100 should be labeled the same (as 0s in this case).
This range is known as the Versicolor class.
Observations 101 to 150 should be labeled the same (as 2s in this case).
This range is known as the Virginica class.
It doesn't matter whether K-means labeled each set of 50 with a 0, 1, or 2. As long as each set of 50 has the same label, it accurately predicted the outcome. It's up to you to give each cluster a name and to find meaning in each cluster. If you run the K-means algorithm again, it may produce an entirely different number for each set of 50 — but the meaning would be the same for each set (class).
You can see that the current shape of the Iris dataset is 150 rows, with each row having 4 fields. You can find the shape by entering this line of code:
>>> iris.data.shape
(150, 4)
The following code will do the dimension reduction:
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2).fit(iris.data)
>>> pca_2d = pca.transform(iris.data)
After you run the dimensionality reduction code, you can see that the transformed iris data will be stored into a new variable name, pca_2d
. You can verify that the shape has been transformed to two dimensions:
>>> pca_2d.shape
(150, 2)
You can also type the pca_2d
variable into the interpreter and it will output arrays (think of an array as a container that stores items in a list) with two fields instead of four. Now that you have the reduced feature set, you can plot the results with the following code:
>>> import pylab as pl
>>> pl.figure('Figure 13-1')
>>> for i in range(0, pca_2d.shape[0]):
>>> if iris.target[i] == 0:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif iris.target[i] == 1:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> elif iris.target[i] == 2:
>>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b',
marker='*')
>>> pl.legend([c1, c2, c3], ['Setosa', 'Versicolor',
'Virginica'])
>>> pl.title('Iris dataset with 3 clusters and known
outcomes')
>>> pl.show()
The output of this code is a plot that should be similar to Figure 13-1. This is a plot representing how the known outcomes of the Iris dataset should look like. It's what you would like the K-means clustering to achieve. The figure shows a scatter plot, which is a graph of plotted points representing an observation on a graph, of all 150 observations. As indicated on the graph plots and legend:
Figure 13-2 shows a visual representation of the data that we are asking K-means to cluster: a scatter plot with 150 data points that haven't been labeled (hence all the data points are the same color and shape). The K-means algorithm doesn't know any target outcomes. The actual data that we're going to run through the algorithm hasn't had its dimensionality reduced yet.
The following lines of code create this scatter plot, using the X and Y values of pca_2d
and coloring all the data points black (c='black'
sets the color to black).
>>> pl.figure('Figure 13-2')
>>> pl.scatter(pca_2d[:,0],pca_2d[:,1],c='black')
>>> pl.title('Iris dataset without labels as seen by K-means')
>>> pl.show()
After K-means has fitted the Iris data, you can make a scatter plot of the clusters that the algorithm produced; just run the following code:
>>> pl.figure('Figure 13-3')
>>> for i in range(0, pca_2d.shape[0]):
>>> if kmeans.labels_[i] == 1:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif kmeans.labels_[i] == 0:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> elif kmeans.labels_[i] == 2:
>>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b',
marker='*')
>>> pl.legend([c1, c2, c3],['Cluster 1', 'Cluster 0',
'Cluster 2'])
>>> pl.title('K-means clusters the Iris dataset into 3
clusters')
>>> pl.show()
Recall that K-means labeled the first 50 observations with the label of 1
, the second 50 with label of 0
, and the last 50 with the label of 2
. In the preceding code, the lines with the if
, elif
, and legend
statements reflects those labels. This change was made to make it easy to compare with the actual results.
The output of the scatter plot is shown in Figure 13-3.
Compare the K-means clustering output (shown in Figure 13-3) to the original scatter plot (refer to Figure 13-1) — which provides labels because the outcomes are known. You can see that the two plots resemble each other. The K-means algorithm did a pretty good job with the clustering. Although the predictions aren't perfect, they come close. That's a win for the algorithm.
In unsupervised learning, you rarely get an output that's 100 percent accurate because real-world data is rarely that simple. You won't know for sure how many clusters to choose (or other initialization parameter(s) for other clustering algorithms). You will have to handle outliers (data points that don't seem consistent with others) and complex datasets that are dense, highly dimensional, and not linearly separable.
Here's the full listing of the code that creates both scatter plots and color-codes the data points:
>>> from sklearn.decomposition import PCA
>>> from sklearn.cluster import KMeans
>>> from sklearn.datasets import load_iris
>>> import pylab as pl
>>> iris = load_iris()
>>> pca = PCA(n_components=2).fit(iris.data)
>>> pca_2d = pca.transform(iris.data)
>>> pl.figure('Reference Plot')
>>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=iris.target)
>>> kmeans = KMeans(n_clusters=3, random_state=111)
>>> kmeans.fit(iris.data)
>>> pl.figure('K-means with 3 clusters')
>>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=kmeans.labels_)
>>> pl.show()
A common outcome for clustering the Iris dataset is a two-cluster solution: one cluster contains the Setosa class and the other contains both the Versicolor and Virginica classes.
If you didn't have prior knowledge of how many clusters the Iris dataset has, you may have chosen to use two clusters with the K-means algorithm. With two clusters, K-means correctly clusters the Setosa class and combines the Virginica and the Versicolor classes into a single cluster.
The following code uses K-means to create two clusters, after which it displays a scatter plot of the results. Figure 13-4 shows the output of the K-means two-cluster solution.
>>> kmeans2 = KMeans(n_clusters=2, random_state=111)
>>> kmeans2.fit(iris.data)
>>> pl.figure('Figure 13-4')
>>> for i in range(0, pca_2d.shape[0]):
>>> if kmeans2.labels_[i] == 1:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif kmeans2.labels_[i] == 0:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> pl.legend([c1, c2], ['Cluster 1', 'Cluster 2'])
>>> pl.title('K-means clusters the Iris dataset into 2
clusters')
>>> pl.show()
At first glance, the results seem to be within reason — and a potential candidate you might use to create your model. In fact, if you did use those results to make your predictive model, your success rate would be around 67 percent — not bad for a very basic model that uses unsupervised learning and a wrong guess for the number of clusters. You would have expected the accuracy to be around 67 percent because the algorithm is very accurate at clustering the linearly separable Setosa class (33.3 percent of the data). Clustering the remaining data into a single class would automatically give it an additional 33.3-percent accuracy because it only has two possibilities to choose from.
A four-cluster solution may yield a result that has one large cluster on the left (Setosa) and one on the right that's separated into three clusters (as shown in Figure 13-5). As you start increasing the value of K (the number of clusters), however, your results become less meaningful.
The following code creates a four-cluster model with K-means:
>>> kmeans4 = KMeans(n_clusters=4, random_state=111)
>>> kmeans4.fit(iris.data)
>>> pl.figure('Figure 13-5')
>>> for i in range(0, pca_2d.shape[0]):
>>> if kmeans4.labels_[i] == 1:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif kmeans4.labels_[i] == 0:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> elif kmeans4.labels_[i] == 2:
>>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b',
marker='*')
>>> elif kmeans4.labels_[i] == 3:
>>> c4 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='c',
marker='^')
>>> pl.legend([c1, c2, c3, c4], ['Cluster 1', 'Cluster 2',
'Cluster 3', 'Cluster 4'])
>>> pl.title('K-means clusters the Iris dataset into 4
clusters')
>>> pl.show()
When you've chosen your number of clusters and have set up the algorithm to populate the clusters, you have a predictive model. The following example uses the three-cluster model previously built in this chapter. You can make predictions based on new incoming data by calling the predict
function of the K-means instance and passing in an array of observations. It looks like this:
>>> # to call the predict function with a single observation
>>> kmeans.predict([[5.1, 3.5, 1.4, 0.2]])
array([1])
When the predict
function finds the cluster center that the observation is closest to, it outputs the index of that cluster center's array. Python arrays are indexed at 0 (that is, the first item starts at 0). Observations closest to a cluster center will be grouped into that cluster. In this example, the K-means algorithm predicts that the observation belongs to Cluster 1 (Setosa in this case) — an easy prediction because the Setosa class is linearly separable and far away from the other two classes. Also, we just selected the very first observation from the dataset to make the prediction verifiable and easy to explain. You can see that the attributes of the observation we're trying to predict are very close to the second cluster center’s attributes (kmeans.cluster_centers_[1]
).
The new observation that we are trying to predict [5.1, 3.5, 1.4, 0.2] is closest to the second cluster center [5.006, 3.418,1.464, 0.244].
To see all the cluster centers, type the following code:
>>> kmeans.cluster_centers_
array([[ 5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[ 5.006 , 3.418 , 1.464 , 0.244 ],
[ 6.85 , 3.07368421, 5.74210526, 2.07105263]])
You can also use the predict
function to evaluate a set of observations, as shown here:
>>> # to call the predict method with a set of data points
>>> kmeans.predict([[ 5.1, 3.5, 1.4, 0.2 ],
[ 5.9, 3.0, 5.1, 1.8 ]])
array([1,0])
The result is an array with a list of predictions. The first observation is predicted to be Cluster 1, and the second is predicted to be Cluster 0.
To see the cluster labels that the K-means algorithm produces, type the following code:
>>> kmeans.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2,
0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2,
2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])
Although you know that the three-cluster solution is correct, don't be surprised if intuitively the two-cluster solution seems to look the best and the four-cluster solution also looks quite reasonable. If you increase the number of clusters beyond three, your predictions' success rate starts to break down. With a little bit of luck (and some educated guessing), you'll choose the best number of clusters. Consider the process as mixing a little bit of art with science. Even the algorithm itself uses randomness in its selection of the initial data points it uses to start each cluster. So even if you're guessing, you're in good company.
For this toy dataset, we can measure how well K-means clustered the 150 observations, because we know the outcomes. We can look at the labels K-means created and see how the clusters were formed. We know that the first 50 observations should be the same cluster (Cluster 1), the second 50 should be the same cluster (Cluster 0), and the last 50 should be the same cluster (Cluster 2). Here are the actual results:
Evaluating the performance of an algorithm requires a label that represents the expected value and a predicted value to compare it with. Remember that when you apply a clustering algorithm to an unsupervised learning model, you don't know what the expected values are — and you don't give labels to the clustering algorithm. The algorithm puts data points into clusters on the basis of which data points are most similar to one another by finding the closest cluster center for each observation. For the Iris dataset, K-means has no concept of Setosa, Versicolor, or Virginica classes; it only knows it's supposed to cluster the data into three clusters and name them randomly from 0 to 2.
The purpose of unsupervised learning with clustering is for data exploration; to find meaningful relationships in the data, preferably where you couldn't have seen them otherwise. Did the model form distinct and interpretable groups of clusters for a market segmentation project? It's up to you to decide whether those relationships are a good basis for an actionable insight.
As mentioned earlier, DBSCAN is a popular clustering algorithm used as an alternative to K-means. It doesn't require that you input the number of clusters in order to run. But in exchange, you have to tune two other parameters. The scikit-learn
implementation provides a default for the two parameters, eps
and min_samples
, but you're generally expected to tune those. The eps
parameter is the maximum distance between two data points to be considered in the same neighborhood. The min_samples
parameter is the minimum amount of data points in a neighborhood to be considered a cluster.
One advantage that DBSCAN has over K-means is that DBSCAN isn't restricted to a set number of clusters during initialization. The algorithm will determine a number of clusters based on the density of a region. Keep in mind, however, that the algorithm depends on the eps
and min_samples
parameters to figure out what the density of each cluster should be. The thinking is that these two parameters are much easier to choose for some clustering problems.
Because the DBSCAN algorithm has a built-in concept of noise, it's commonly used to detect outliers in the data — for example, fraudulent activity in credit cards, e-commerce, or insurance claims.
You'll need to load the Iris dataset into your Python session. If you're continuing from the preceding section and already have it loaded, you can skip Steps 1 and 2. Here's the procedure:
Open a new Python interactive shell session.
Use a new Python session so that memory is clear and you have a clean slate to work with.
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
After running those two statements, you shouldn't see any messages from the interpreter. The variable iris
should contain all the data from the iris.csv
file.
>>> from sklearn.cluster import DBSCAN
>>> dbscan = DBSCAN()
The first line of code imports the DBSCAN
library into the session for you to use. The second line creates an instance of DBSCAN with default values for eps
and min_samples
.
>>> dbscan
DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean', min_samples=5, p=None,
random_state=None)
>>> dbscan.fit(iris.data)
>>> dbscan.labels_
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., -1., 0., 0., 0., 0., 0., 0., 0., 0.,
1., 1., 1., 1., 1., 1., 1., -1., 1., 1.,
-1., 1., 1., 1., 1., 1., 1., 1., -1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., -1., 1., 1.,
1., 1., 1., -1., 1., 1., 1., 1., -1., 1.,
1., 1., 1., 1., 1., -1., -1., 1., -1., -1.,
1., 1., 1., 1., 1., 1., 1., -1., -1., 1.,
1., 1., -1., 1., 1., 1., 1., 1., 1., 1.,
1., -1., 1., 1., -1., -1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
If you look very closely, you'll see that DBSCAN produced three groups (–1
, 0
, and 1
).
Let's get a scatter plot of the DBSCAN output. Type the following code:
>>> import pylab as pl
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2).fit(iris.data)
>>> pca_2d = pca.transform(iris.data)
>>> for i in range(0, pca_2d.shape[0]):
>>> if dbscan.labels_[i] == 0:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif dbscan.labels_[i] == 1:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> elif dbscan.labels_[i] == -1:
>>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b',
marker='*')
>>> pl.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2',
'Noise'])
>>> pl.title('DBSCAN finds 2 clusters and noise)
>>> pl.show()
The scatter plot output of this code is shown in Figure 13-6.
You can see that DBSCAN produced three groups. Note, however, that the figure closely resembles a two-cluster solution: It shows only 17 instances of label –1
. That's because it's a two-cluster solution; the third group (–1
) is noise (outliers).
You can increase the distance parameter (eps
) from the default setting of 0.5 to 0.9, and it will become a two-cluster solution with no noise. The distance parameter is the maximum distance an observation is to the nearest cluster. The greater the value for the distance parameter, the fewer clusters are found because clusters eventually merge into other clusters. The –1
labels are scattered around Cluster 1 and Cluster 2 in a few locations:
Near the center of Cluster 2 (Versicolor and Virginica classes)
The graph only shows a two-dimensional representation of the data. The distance can also be measured in higher dimensions.
In this example, DBSCAN didn't produce the ideal outcome with the default parameters for the Iris dataset. Its performance was pretty consistent with other clustering algorithms that end up with a two-cluster solution. The Iris dataset doesn't take advantage of DBSCAN's most powerful features — noise detection and the capability to discover clusters of arbitrary shapes.
Another clustering algorithm offered in scikit-learn
is the mean shift algorithm. This algorithm, like DBSCAN, doesn't require you to specify the number of clusters, or any other parameters, when you create the model. The primary tuning parameter for this algorithm is called the bandwidth
parameter. You can think of bandwidth
like choosing the size of a round window that can encompass the data points in a cluster. Choosing a value for bandwidth isn't trivial, so we’ll go with the default.
The steps to create a new model with a different algorithm is essentially the same each time. If you’ve been following along in this chapter or Chapter 12, you’ll most likely have everything nearly set up. The steps are similar to the steps for creating the model with K-means:
Open a new Python interactive shell session.
Use a new Python session so that memory is clear and you have a clean slate to work with.
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> from sklearn.cluster import MeanShift
>>> ms = MeanShift()
Mean shift created with default value for bandwidth
.
>>> ms
MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds= None)
>>> ms.fit(iris.data)
>>> ms.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Mean shift produced two clusters (0
and 1
).
A scatter plot is a good way to visualize the relationship between a large number of data points. It's useful for visually identifying clusters of data and finding data points that are distant from formed clusters.
Let's produce a scatter plot of the DBSCAN output. Type the following code:
>>> import pylab as pl
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2).fit(iris.data)
>>> pca_2d = pca.transform(iris.data)
>>> pl.figure('Figure 13-7')
>>> for i in range(0, pca_2d.shape[0]):
>>> if ms.labels_[i] == 1:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif ms.labels_[i] == 0:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> pl.legend([c1, c2], ['Cluster 1', 'Cluster 2')]
>>> pl.title('Mean shift finds 2 clusters)
>>> pl.show()
The scatter plot output of this code is shown in Figure 13-7.
Figure 13-7 shows that mean shift found two clusters. You can try to tune the model with the bandwidth
parameter to see if you can get a three-cluster solution. Mean shift is very sensitive to the bandwidth
parameter:
Mean shift didn't produce the ideal results with the default parameters for the Iris dataset, but a two-cluster solution is in line with other clustering algorithms. Each project has to be examined individually to see how well the number of cluster fits the business problem. The obvious benefit of using mean shift is that you don’t have to predetermine the number of clusters. In fact, you can use mean shift as a tool to find the number of clusters for creating a K-means model. Mean shift is often used for computer vision applications because it's good at lower dimensions, accommodates clusters of any shape, and accommodates clusters of any size.