Overview
This chapter will introduce you to the fundamentals of clustering, an unsupervised learning approach in contrast with the supervised learning approaches seen in the previous chapters. You will be implementing different types of clustering, including flat clustering with the k-means algorithm and hierarchical clustering with the mean shift algorithm and the agglomerative hierarchical model. You will also learn how to evaluate the performance of your clustering model using intrinsic and extrinsic approaches. By the end of this chapter, you will be able to analyze data using clustering and apply this skill to solve challenges across a variety of fields.
In the previous chapter, you were introduced to decision trees and their applications in classification. You were also introduced to regression in Chapter 2, An Introduction to Regression. Both regression and classification are part of the supervised learning approach. However, in this chapter, we will be looking at the unsupervised learning approach; we will be dealing with datasets that don't have any labels (outputs). It is up to the machines to tell us what the labels will be based on a set of parameters that we define. In this chapter, we will be performing unsupervised learning by using clustering algorithms.
We will use clustering to analyze data to find certain patterns and create groups. Apart from that, clustering can be used for many purposes:
Regardless of whether we are applying clustering to genetics, videos, images, or social networks, if we analyze data using clustering, we may find similarities between data points that are worth treating uniformly.
For instance, consider a store manager, who is responsible for ensuring the profitability of their store. The products in the store are divided into different categories, and there are different customers who prefer different items. Each customer has their own preferences, but they have some similarities between them. You might have a customer who is interested in bio products, who tends to choose organic products, which are also of interest to a vegetarian customer. Even if they are different, they have similarities in their preferences or patterns as they both tend to buy organic vegetables. This can be treated as an example of clustering.
In Chapter 3, An Introduction to Classification, you learned about classification, which is a part of the supervised learning approach. In a classification problem, we use labels to train a model in order to be able to classify data points. With clustering, as we do not have labels for our features, we need to let the model figure out the clusters to which these features belong. This is usually based on the distance between each data point.
In this chapter, you will learn about the k-means algorithm, which is the most widely used algorithm for clustering, but first, we need to define what the clustering problem is.
We shall define the clustering problem so that we will be able to find similarities between our data points. For instance, suppose we have a dataset that consists of points. Clustering helps us understand this structure by describing how these points are distributed.
Let's look at an example of data points in a two-dimensional space in Figure 5.1:
Now, have a look, at Figure 5.2. It is evident that there are three clusters:
The three clusters were easy to detect because the points are close to one another. Here, you can see that clustering determines the data points that are close to each other. You may have also noticed that the data points M1, O1, and N1 do not belong to any cluster; these are the outlier points. The clustering algorithm you build should be prepared to treat these outlier points properly, without moving them into a cluster.
While it is easy to recognize clusters in a two-dimensional space, we normally have multidimensional data points, which is where we have more than two features. Therefore, it is important to know which data points are close to one other. Also, it is important to define the distance metrics that detect whether data points are close to each other. One well-known distance metric is Euclidean distance, which we learned about in Chapter 1, Introduction to Artificial Intelligence. In mathematics, we often use Euclidean distance to measure the distance between two points. Therefore, Euclidean distance is an intuitive choice when it comes to clustering algorithms so that we can determine the proximity of data points when locating clusters.
However, there is one drawback to most distance metrics, including Euclidean distance: the more we increase the dimensions, the more uniform these distances will become compared to each other. When we only have a few dimensions or features, it is easy to see which point is the closest to another one. However, when we add more features, the relevant features get embedded with all the other data and it becomes very hard to distinguish the relevant features from the others as they act as noise for our model. Therefore, getting rid of these noisy features may greatly increase the accuracy of our clustering model.
Note
Noise in a dataset can be irrelevant information or randomness that is unwanted.
In the next section, we will be looking at two different clustering approaches.
There are two types of clustering:
In flat clustering, we specify the number of clusters we would like the machine to find. One example of flat clustering is the k-means algorithm, where k specifies the number of clusters we would like the algorithm to use.
In hierarchical clustering, however, the machine learning algorithm itself finds out the number of clusters that are needed.
Hierarchical clustering also has two approaches:
Figure 5.3 gives you a much more accurate description of these two clustering approaches.
Now that we are familiar with the different clustering approaches, let's take a look at the different clustering algorithms supported by scikit-learn.
In this chapter, we will learn about two clustering algorithms supported by scikit-learn:
K-means is an example of flat clustering, where we must specify the number of clusters in advance. k-means is a general-purpose clustering algorithm that performs well if the number of clusters is not too high and the size of the clusters is uniform.
Mean shift is an example of hierarchical clustering, where the clustering algorithm determines the number of clusters. Mean shift is used when we do not know the number of clusters in advance. In contrast with k-means, mean shift supports use cases where there may be many clusters present, even if the size of the clusters greatly varies.
Scikit-learn contains many other algorithms, but we will be focusing on the k-means and mean shift algorithms in this chapter.
Note
For a complete description of clustering algorithms, including performance comparisons, visit the clustering page of scikit-learn at http://scikit-learn.org/stable/modules/clustering.html.
In the next section, we begin with the k-means algorithm.
The k-means algorithm is a flat clustering algorithm, as mentioned previously. It works as follows:
To ensure that the k-means algorithm terminates, we need the following:
Due to the nature of the k-means algorithm, it will have a hard time dealing with clusters that greatly vary in size.
The k-means algorithm has many use cases that are part of our everyday lives, such as:
In the next exercise, we will be implementing the k-means algorithm in scikit-learn.
In this exercise, we will be plotting a dataset in a two-dimensional plane and performing clustering on it using the k-means algorithm.
The following steps will help you complete this exercise:
import numpy as np
data_points = np.array([[1, 1], [1, 1.5], [2, 2],
[8, 1], [8, 0], [8.5, 1],
[6, 1], [1, 10], [1.5, 10],
[1.5, 9.5], [10, 10], [1.5, 8.5]])
import matplotlib.pyplot as plot
plot.scatter(data_points.transpose()[0],
data_points.transpose()[1])
The expected output is this:
Note
We used the transpose array method to get the values of the first feature and the second feature. We could also use proper array indexing to access these columns: dataPoints[:,0], which is equivalent to dataPoints.transpose()[0].
Now that we have the data points, it is time to execute the k-means algorithm on them.
from sklearn.cluster import KMeans
k_means_model = KMeans(n_clusters=3,random_state=8)
k_means_model.fit(data_points)
In the preceding code snippet, we have used the KMeans module from sklearn.cluster. As always with sklearn, we need to define a model with the parameter and then fit the model on the dataset.
The expected output is this:
KMeans(algorithm='auto', copy_x=True, init='k-means++',
max_iter=300, n_clusters=3, n_init=10, n_jobs=None,
precompute_distances='auto',
random_state=8, tol=0.0001, verbose=0)
The output shows all the parameters for our k-means models, but the important ones are:
max_iter: Represents the maximum number of times the k-means algorithm will iterate through.
n_clusters: Represents the number of clusters to be formed by the k-means algorithm.
n_init: Represents the number of times the k-means algorithm will initialize a random point.
tol: Represents the threshold for checking whether the k-means algorithm can terminate.
centers = k_means_model.cluster_centers_
centers
The output of centers will be as follows:
array([[7.625 , 0.75 ],
[3.1 , 9.6 ],
[1.33333333, 1.5 ]])
This output is showing the coordinates of the center of our three clusters. If you look back at Figure 5.4, you will see that the center points of the clusters appear to be in the bottom-left, (1.3, 1.5), the top-left (3.1, 9.6), and the bottom-right (7.265, 0.75) corners of the graph. The x coordinate of the top-left cluster is 3.1, most likely because it contains our outlier data point at [10, 10].
labels = k_means_model.labels_
labels
The output of labels will be as follows:
array([2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 1])
The output array shows which data point belongs to which cluster. This is all we need to plot the data.
plot.scatter(centers[:,0], centers[:,1])
for i in range(len(data_points)):
plot.plot(data_points[i][0], data_points[i][1],
['k+','kx','k_'][k_means_model.labels_[i]])
plot.show()
In the preceding code snippets, we used the matplotlib library to plot the data points with the center of each coordinate. Each cluster has its marker (x, +, and -), and its center is represented by a filled circle.
The expected output is this:
Having a look at Figure 5.5, you can see that the center points are inside their clusters, which are represented by the x, +, and - marks.
k_means_model = KMeans(n_clusters=2,random_state=8)
k_means_model.fit(data_points)
centers2 = k_means_model.cluster_centers_
labels2 = k_means_model.labels_
plot.scatter(centers2[:,0], centers2[:,1])
for i in range(len(data_points)):
plot.plot(data_points[i][0], data_points[i][1],
['k+','kx'][labels2[i]])
plot.show()
The expected output is this:
This time, we only have x and + points, and we can clearly see a bottom cluster and a top cluster. Interestingly, the top cluster in the second try contains the same points as the top cluster in the first try. The bottom cluster of the second try consists of the data points joining the bottom-left and the bottom-right clusters of the first try.
predictions = k_means_model.predict([[5,5],[0,10]])
predictions
The output of predictions is as follows:
array([0, 1], dtype=int32)
This means that our first point belongs to the first cluster (at the bottom) and the second point belongs to the second cluster (at the top).
Note
To access the source code for this specific section, please refer to https://packt.live/2CpvMDo.
You can also run this example online at https://packt.live/2Nnv7F2. You must execute the entire Notebook in order to get the desired result.
By completing this exercise, you were able to use a simple k-means clustering model on sample data points.
Like the classification and regression models in Chapter 2, An Introduction to Regression, Chapter 3, An Introduction to Classification, and Chapter 4, An Introduction to Decision Trees, the k-means algorithm can also be parameterized. The complete list of parameters can be found at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html.
Some examples are as follows:
We also used two attributes to retrieve the cluster center points and the clusters themselves:
In this exercise, you will be able to understand the usage of cluster_centers_ and labels_.
The following steps will help you complete the exercise:
import numpy as np
import matplotlib.pyplot as plot
from sklearn.cluster import KMeans
data_points = np.array([[1, 1], [1, 1.5], [2, 2],
[8, 1], [8, 0], [8.5, 1],
[6, 1], [1, 10], [1.5, 10],
[1.5, 9.5], [10, 10], [1.5, 8.5]])
k_means_model = KMeans(n_clusters=4,random_state=8)
k_means_model.fit(data_points)
centers = k_means_model.cluster_centers_
centers
The output of centers is as follows:
array([[ 7.625 , 0.75 ],
[ 1.375 , 9.5 ],
[ 1.33333333, 1.5 ],
[10. , 10. ]])
The output of the cluster_centers_ property shows the x and y coordinates of the center points.
From the output, we can see the 4 centers, which are bottom right (7.6, 0.75), top left (1.3, 9.5), bottom left (1.3, 1.5), and top right (10, 10). We can also note that the fourth cluster (the top-right cluster) is only made of a single data point. This data point can be assumed to be an outlier.
labels = k_means_model.labels_
labels
The output of labels is as follows:
array([2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 3, 1], dtype=int32)
The labels_ property is an array of length 12, showing the cluster of each of the 12 data points it belongs to. The first cluster is associated with the number 0, the second is associated with 1, the third is associated with 2, and so on (remember that Python indexes always start from 0 and not 1).
Note
To access the source code for this specific section, please refer to https://packt.live/3dmHsDX.
You can also run this example online at https://packt.live/2B0ebld. You must execute the entire Notebook in order to get the desired result.
By completing this exercise, you were able to retrieve the coordinates of a cluster's center. You were also able to see which label (cluster) each data point has been assigned to.
In the upcoming activity, we will be looking at sales data, and we will perform k-means clustering on that sales data.
In this activity, you will work on the Sales Transaction Dataset Weekly dataset, which contains the weekly sales data of 800 products over 1 year. Our dataset won't contain any information regarding the product except sales.
Your goal will be to identify products with similar sales trends using the k-means clustering algorithm. You will have to experiment with the number of clusters in order to find the optimal number of clusters.
Note
The dataset can be found at https://archive.ics.uci.edu/ml/datasets/Sales_Transactions_Dataset_Weekly.
The dataset file can also be found in our GitHub repository: https://packt.live/3hVH42v.
Citation: Tan, S., & San Lau, J. (2014). Time series clustering: A superior alternative for market basket analysis. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013) (pp. 241–248).
The following steps will help you complete this activity:
The expected output is this:
Note
The solution to this activity is available on page 363.
Now that you have seen the k-means algorithm in detail, we will move on to another type of clustering algorithm, the mean shift algorithm.
Mean shift is a hierarchical clustering algorithm that assigns data points to a cluster by calculating a cluster's center and moving it towards the mode at each iteration. The mode is the area with the most data points. At the first iteration, a random point will be chosen as the cluster's center and then the algorithm will calculate the mean of all nearby data points within a certain radius. The mean will be the new cluster's center. The second iteration will then begin with the calculation of the mean of all nearby data points and setting it as the new cluster's center. At each iteration, the cluster's center will move closer to where most of the data points are. The algorithm will stop when it is not possible for a new cluster's center to contain more data points. When the algorithm stops, each data point will be assigned to a cluster.
The mean shift algorithm will also determine the number of clusters needed, in contrast with the k-means algorithm. This is advantageous as we rarely know how many clusters we are looking for.
This algorithm also has many use cases. For instance, the Xbox Kinect device detects human body parts using the mean shift algorithm. Each main body part (head, arms, legs, hands, and so on) is a cluster of data points assigned by the mean shift algorithm.
In the next exercise, we will be implementing the mean shift algorithm.
In this exercise, we will implement clustering by using the mean shift algorithm.
We will use the scipy.spatial library in order to compute the Euclidean distance, seen in Chapter 1, Introduction to Artificial Intelligence. This library simplifies the calculation of distances (such as Euclidean or Manhattan) between a list of coordinates. More details about this library can be found at https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance.
The following steps will help you complete the exercise:
import numpy as np
data_points = np.array([[1, 1], [1, 1.5], [2, 2],
[8, 1], [8, 0], [8.5, 1],
[6, 1], [1, 10], [1.5, 10],
[1.5, 9.5], [10, 10], [1.5, 8.5]])
import matplotlib.pyplot as plot
plot.scatter(data_points.transpose()[0],
data_points.transpose()[1])
Our task now is to find point P (x, y), for which the number of data points within radius R from point P is maximized. The points are distributed as follows:
P1 = [1, 1]
from scipy.spatial import distance
r = 2
points1 = np.array([p0 for p0 in data_points if
distance.euclidean(p0, P1) <= r])
points1
In the preceding code snippet, we used the Euclidean distance to find all the points that fall within the r radius of point P1.
The output of points1 will be as follows:
array([[1. , 1. ],
[1. , 1.5],
[2. , 2. ]])
From the output, we can see that we found three points that fall within the radius of P1. They are the three points at the bottom left of the graph we saw earlier, in Figure 5.8 of this chapter.
P2 = [np.mean( points1.transpose()[0] ),
np.mean(points1.transpose()[1] )]
P2
In the preceding code snippet, we have calculated the mean of the array containing the three data points in order to obtain the new coordinates of P2.
The output of P2 will be as follows:
[1.3333333333333333, 1.5]
points2 = np.array([p0 for p0 in data_points if
distance.euclidean( p0, P2) <= r])
points2
The output of points will be as follows:
array([[1. , 1. ],
[1. , 1.5],
[2. , 2. ]])
These are the same three points that we found in Step 4, so we can stop here. Three points have been found around the mean of [1.3333333333333333, 1.5]. The points around this center within a radius of 2 form a cluster.
P3 = [8, 1]
points3 = np.array( [p0 for p0 in data_points if
distance.euclidean(p0, P3) <= r])
points3
In the preceding code snippet, we used the same code as Step 4 but with a new P3.
The output of points3 will be as follows:
array([[8. , 1. ],
[8. , 0. ],
[8.5, 1. ],
[6. , 1. ]])
This time, we found four points inside the radius r of P4.
P4 = [np.mean(points3.transpose()[0]),
np.mean(points3.transpose()[1])]
P4
In the preceding code snippet, we calculated the mean of the array containing the four data points in order to obtain the new coordinates of P4, as in Step 5.
The output of P4 will be as follows:
[7.625, 0.75]
This mean will not change because in the next iteration, we will find the same data points.
P5 = [8, 0]
points4 = np.array([p0 for p0 in data_points if
distance.euclidean(p0, P5) <= r])
points4
In the preceding code snippet, we used the same code as in Step 4 but with a new P5.
The output of points4 will be as follows:
array([[8. , 1. ],
[8. , 0. ],
[8.5, 1. ]])
This time, we found three points inside the radius r of P5.
P6 = [np.mean(points4.transpose()[0]),
np.mean(points4.transpose()[1])]
P6
In the preceding code snippet, we calculated the mean of the array containing the three data points in order to obtain the new coordinates of P6.
The output of P6 will be as follows:
[8.166666666666666, 0.6666666666666666]
P7 = [8.5, 1]
points5 = np.array([p0 for p0 in data_points if
distance.euclidean(p0, P7) <= r])
points5
In the preceding code snippet, we used the same code as in Step 4 but with a new P7.
The output of points5 will be as follows:
array([[8. , 1. ],
[8. , 0. ],
[8.5, 1. ]])
This time, we found the same three points again inside the radius r of P. This means that starting from [8,1], we got a larger cluster than starting from [8, 0] or [8.5, 1]. Therefore, we must take the center point that contains the maximum number of data points.
P8 = [6, 1]
points6 = np.array([p0 for p0 in data_points if
distance.euclidean(p0, P8) <= r])
points6
In the preceding code snippet, we used the same code as in Step 4 but with a new P8.
The output of points6 will be as follows:
array([[8., 1.],
[6., 1.]])
This time, we found only two data points inside the radius r of P8. We successfully found the data point [8, 1].
P9 = [np.mean(points6.transpose()[0]),
np.mean(points6.transpose()[1]) ]
P9
In the preceding code snippet, we calculated the mean of the array containing the three data points in order to obtain the new coordinates of P9, as in Step 5.
The output of P9 will be as follows:
[7.0, 1.0]
points7 = np.array([p0 for p0 in data_points if
distance.euclidean(p0, P9) <= r])
points7
In the preceding code snippet, we used the same code as in Step 4 but with a new P9.
The output of points7 will be as follows:
array([[8. , 1. ],
[8. , 0. ],
[8.5, 1. ],
[6. , 1. ]])
We successfully found all four points. Therefore, we have successfully defined a cluster of size 4. The mean will be the same as before: [7.625, 0.75].
Note
To access the source code for this specific section, please refer to https://packt.live/3drUZtE.
You can also run this example online at https://packt.live/2YoSu78. You must execute the entire Notebook in order to get the desired result.
This was a simple clustering example that applied the mean shift algorithm. We only illustrated what the algorithm considers when finding clusters.
However, there is still one question, and that is what will the value of the radius be?
Note that if the radius of 2 was not set, we could simply start either with a huge radius that includes all data points and then reduce the radius, or we could start with a tiny radius, making sure that each data point is in its cluster, and then increase the radius until we get the desired result.
In the next section, we will be looking at the mean shift algorithm but using scikit-learn.
Let's use the same data points we used with the k-means algorithm:
import numpy as np
data_points = np.array([[1, 1], [1, 1.5], [2, 2],
[8, 1], [8, 0], [8.5, 1],
[6, 1], [1, 10], [1.5, 10],
[1.5, 9.5], [10, 10], [1.5, 8.5]])
The syntax of the mean shift clustering algorithm is like the syntax for the k-means clustering algorithm:
from sklearn.cluster import MeanShift
mean_shift_model = MeanShift()
mean_shift_model.fit(data_points)
Once the clustering is done, we can access the center point of each cluster:
mean_shift_model.cluster_centers_
The expected output is this:
array([[ 1.375 , 9.5 ],
[ 8.16666667, 0.66666667],
[ 1.33333333, 1.5 ],
[10. , 10. ],
[ 6. , 1. ]])
The mean shift model found five clusters with the centers shown in the preceding code.
Like k-means, we can also get the labels:
mean_shift_model.labels_
The expected output is this:
array([2, 2, 2, 1, 1, 1, 4, 0, 0, 0, 3, 0], dtype=int64)
The output array shows which data point belongs to which cluster. This is all we need to plot the data:
import matplotlib.pyplot as plot
plot.scatter(mean_shift_model.cluster_centers_[:,0],
mean_shift_model.cluster_centers_[:,1])
for i in range(len(data_points)):
plot.plot(data_points[i][0], data_points[i][1],
['k+','kx','kv', 'k_', 'k1']
[mean_shift_model.labels_[i]])
plot.show()
In the preceding code snippet, we made a plot of the data points and the centers of the five clusters. Each data point belonging to the same cluster will have the same marker. The cluster centers are marked as a dot.
The expected output is this:
We can see that three clusters contain more than a single dot (the top left, the bottom left, and the bottom right). The two single data points that are also their own cluster can be seen as outliers, as mentioned previously, as they are too far from the other clusters to be part of any of them.
Now that we have learned about the mean shift algorithm, we can have look at hierarchical clustering, and more specifically at agglomerative hierarchical clustering (the bottom-up approach).
Hierarchical clustering algorithms fall into two categories:
We will only talk about agglomerative hierarchical clustering in this chapter, as it is the most widely used and most efficient of the two approaches.
Agglomerative hierarchical clustering treats each data point as a single cluster in the beginning and then successively merges (or agglomerates) the closest clusters together in pairs. In order to find the closest data clusters, agglomerative hierarchical clustering uses a heuristic such as the Euclidean or Manhattan distance to define the distance between data points. A linkage function will also be required to aggregate the distance between data points in clusters in order to define a unique value of the closeness of clusters.
Examples of linkage functions include single linkage (simple distance), average linkage (average distance), maximum linkage (maximum distance), and Ward linkage (square difference). The pairs of clusters with the smallest value of linkage will be grouped together. The grouping is repeated until we reach a single cluster containing every data point. In the end, this algorithm terminates when there is only a single cluster left.
In order to visually represent the hierarchy of clusters, a dendrogram can be used. A dendrogram is a tree where the leaves at the bottom represent data points. Each intersection between two leaves is the grouping of these two leaves. The root (top) represents a unique cluster that contains all the data points. Have a look at Figure 5.10, which represents a dendrogram.
Have a look at the following example, where we use the same data points as we used with the k-means algorithm:
import numpy as np
data_points = np.array([[1, 1], [1, 1.5], [2, 2],
[8, 1], [8, 0], [8.5, 1],
[6, 1], [1, 10], [1.5, 10],
[1.5, 9.5], [10, 10], [1.5, 8.5]])
In order to plot a dendrogram, we need to first import the scipy library:
from scipy.cluster.hierarchy import dendrogram
import scipy.cluster.hierarchy as sch
Then we can plot a dendrogram using SciPy with the ward linkage function, as it is the most commonly used linkage function:
dendrogram = sch.dendrogram(sch.linkage(data_points,
method='ward'))
The output of the dendrogram will be as follows:
With the dendrogram, we can generally guess what will be a good number of clusters by simply drawing a horizontal line as shown in Figure 5.12, in the area with the highest vertical distance, and counting the number of intersections. In this case, it should be two clusters, but we will go to the next biggest area as two is too small a number.
The y axis represents the measure of closeness, and the x axis represents the index of each data point. So our first three data points (0,1,2) are parts of the same cluster, then another cluster is made of the next four points (3,4,5,6), data point 10 is a cluster on its own, and the remaining data points (7,8,9,11) form the last cluster.
The syntax of the agglomerative hierarchical clustering algorithm is similar to the k-means clustering algorithm except that we need to specify the number type of affinity (here, we choose the Euclidean distance) and the linkage (here, we choose the ward linkage):
from sklearn.cluster import AgglomerativeClustering
agglomerative_model = AgglomerativeClustering(n_clusters=4,
affinity='euclidean',
linkage='ward')
agglomerative_model.fit(data_points)
The output is:
AgglomerativeClustering(affinity='euclidean',
compute_full_tree='auto',
connectivity=None,
distance_threshold=None,
linkage='ward', memory=None,
n_clusters=4, pooling_func='deprecated')
Similar to k-means, we can also get the labels as shown in the following code snippet:
agglomerative_model.labels_
The expected output is this:
array([2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 3, 1], dtype=int64)
The output array shows which data point belongs to which cluster. This is all we need to plot the data:
import matplotlib.pyplot as plot
for i in range(len(data_points)):
plot.plot(data_points[i][0], data_points[i][1],
['k+','kx','kv', 'k_'][agglomerative_model.labels_[i]])
plot.show()
In the preceding code snippet, we made a plot of the data points and the four clusters' centers. Each data point belonging to the same cluster will have the same marker.
The expected output is this:
We can see that, in contrast with the result from the mean shift method, agglomerative clustering was able to properly group the data point at (6,1) with the bottom-right cluster instead of having his own cluster. In situations like this one, where we have a very small amount of data, agglomerative hierarchical clustering and mean shift will work better than k-means. However, they have very expensive computational time requirements, which will make them struggle on very large datasets. However, k-means is very fast and is a better choice for very large datasets.
Now that we have learned about a few different clustering algorithms, we need to start evaluating these models and comparing them in order to choose the best model for clustering.
Unlike supervised learning, where we always have the labels to evaluate our predictions with, unsupervised learning is a bit more complex as we do not usually have labels. In order to evaluate a clustering model, two approaches can be taken depending on whether the label data is available or not:
Note
We will skip the mathematical explanations as they are quite complicated.
You can find more mathematical details on the sklearn website at this URL: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
We will begin with the extrinsic approach (as it is the most widely used method) and define the following scores using sklearn on our k-means example:
Let's have a look at an example in which we first need to import the metrics module from sklearn.cluster:
from sklearn import metrics
We will be reusing the code from our k-means example in Exercise 5.01, Implementing K-Means in scikit-learn:
import numpy as np
import matplotlib.pyplot as plot
from sklearn.cluster import KMeans
data_points = np.array([[1, 1], [1, 1.5], [2, 2],
[8, 1], [8, 0], [8.5, 1],
[6, 1], [1, 10], [1.5, 10],
[1.5, 9.5], [10, 10], [1.5, 8.5]])
k_means_model = KMeans(n_clusters=3,random_state = 8)
k_means_model.fit(data_points)
k_means_model.labels_
The output of our predicted labels using k_means_model.labels_ was:
array([2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 1])
Finally, define the true labels of this dataset, as shown in the following code snippet:
data_labels = np.array([0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 3, 1])
The adjusted Rand index is a function that measures the similarity between the cluster predictions and the labels while ignoring permutations. The adjusted Rand index works quite well when the labels are large equal-sized clusters.
The adjusted Rand index has a range between [-1.1], where negative values are not desirable. A negative score means that our model is performing worse than if we were to randomly assign labels. If we were to randomly assign them, our score would be close to 0. However, the closer we are to 1, the better our clustering model is at predicting the right label.
With sklearn, we can easily compute the adjusted Rand index by using this code:
metrics.adjusted_rand_score(data_labels, k_means_model.labels_)
The expected output is this:
0.8422939068100358
In this case, the adjusted Rand index indicates that our k-means model is not far from our true labels.
The adjusted mutual information is a function that measures the entropy between the cluster predictions and the labels while ignoring permutations.
The adjusted mutual information has no defined range, but negative values are considered bad. The closer we are to 1, the better our clustering model is at predicting the right label.
With sklearn, we can easily compute it by using this code:
metrics.adjusted_mutual_info_score(data_labels,
k_means_model.labels_)
The expected output is this:
0.8769185235006342
In this case, the adjusted mutual information indicates that our k-means model is quite good and not far from our true labels.
The V-Measure is defined as the harmonic mean of homogeneity and completeness. The harmonic mean is a type of average (other types are the arithmetic mean and the geometric mean) using reciprocals (a reciprocal is the inverse of a number. For example the reciprocal of 2 is , and the reciprocal of 3 is ).
The formula of the harmonic mean is as follows:
is the number of values and is the value of each point.
In order to calculate the V-Measure, we first need to define homogeneity and completeness.
Perfect homogeneity refers to a situation where each cluster has data points belonging to the same label. The homogeneity score will reflect how well each of our clusters is grouping data from the same label.
Perfect completeness refers to the situation where all data points belonging to the same label are clustered into the same cluster. The homogeneity score will reflect how well, for each of our labels, its data points are all grouped inside the same cluster.
Hence, the formula of V-Measure is as follows:
has a default value of 1, but it can be changed to further emphasize either homogeneity or completeness.
These three scores have a range between [0,1], with 0 being the worst possible score and 1 being the perfect score.
With sklearn, we can easily compute these three scores by using this code:
metrics.homogeneity_score(data_labels, k_means_model.labels_)
metrics.completeness_score(data_labels, k_means_model.labels_)
metrics.v_measure_score(data_labels, k_means_model.labels_,
beta=1)
The output of homogeneity_score is as follows:
0.8378758055108827
In this case, the homogeneity score indicates that our k-means model has clusters containing different labels.
The output of completeness_score is as follows:
1.0
In this case, the completeness score indicates that our k-means model has successfully put every data point of each label inside the same cluster.
The output of v_measure_score is as follows:
0.9117871871412709
In this case, the V-Measure indicates that our k-means model, while not being perfect, has a good score in general.
The Fowlkes-Mallows score is a metric measuring the similarity within a label cluster and the prediction of the cluster, and this is defined as the geometric mean of the precision and recall (you learned about this in Chapter 4, An Introduction to Decision Trees).
The formula of the Fowlkes-Mallows score is as follows:
Let's break this down:
The Fowlkes-Mallows score has a range between [0, 1], with 0 being the worst possible score and 1 being the perfect score.
With sklearn, we can easily compute it by using this code:
metrics.fowlkes_mallows_score(data_labels, k_means_model.labels_)
The expected output is this:
0.8885233166386386
In this case, the Fowlkes-Mallows score indicates that our k-means model is quite good and not far from our true labels.
The contingency matrix is not a score, but it reports the intersection cardinality for every true/predicted cluster pair and the required label data. It is very similar to the Confusion Matrix seen in Chapter 4, An Introduction to Decision Trees. The matrix must be the same for the label and cluster name, so we need to be careful to give our cluster the same name as our label, which was not the case with the previously seen scores.
We will modify our labels from this:
data_labels = np.array([0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 3, 1])
To this:
data_labels = np.array([2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 3, 0])
Then, with sklearn, we can easily compute the contingency matrix by using this code:
from sklearn.metrics.cluster import contingency_matrix
contingency_matrix(k_means_model.labels_,data_labels)
The output of contingency_matrix is as follows:
array([[0, 4, 0, 0],
[4, 0, 0, 1],
[0, 0, 3, 0]])
The first row of the contingency_matrix output indicates that there are 4 data points whose true cluster is the first cluster (0). The second row indicates that there are also four data points whose true cluster is the second cluster (1); however, an extra 1 was incorrectly predicted in this cluster, but it belongs to the fourth cluster (3). The third row indicates that there are three data points whose true cluster is the third cluster (2).
We will now look at the intrinsic approach, which is required when we do not have the label. We will define the following scores using sklearn on our k-means example:
The Silhouette Coefficient is an example of an intrinsic evaluation. It measures the similarity between a data point and its cluster when compared to other clusters.
It comprises two scores:
The Silhouette Coefficient formula is:
The Silhouette Coefficient has a range between [-1,1], with -1 meaning an incorrect clustering. A score close to zero indicates that our clusters are overlapping. A score close to 1 indicates that all the data points are assigned to the appropriate clusters.
Then, with sklearn, we can easily compute the silhouette coefficient by using this code:
metrics.silhouette_score(data_points, k_means_model.labels_)
The output of silhouette_score is as follows:
0.6753568188872228
In this case, the Silhouette Coefficient indicates that our k-means model has some overlapping clusters, and some improvements can be made by separating some of the data points from one of the clusters.
The Calinski-Harabasz index measures how the data points inside each cluster are spread. It is defined as the ratio of the variance between clusters and the variance inside each cluster. The Calinski-Harabasz index doesn't have a range and starts from 0. The higher the score is, the denser our clusters are. A dense cluster is an indication of a well-defined cluster.
With sklearn, we can easily compute it by using this code:
metrics.calinski_harabasz_score(data_points, k_means_model.labels_)
The output of calinski_harabasz_score is as follows:
19.52509172315154
In this case, the Calinski-Harabasz index indicates that our k-means model clusters are quite spread out and suggests that we might have overlapping clusters.
The Davies-Bouldin index measures the average similarity between clusters. The similarity is a ratio of the distance between a cluster and its closest cluster and the average distance between each data point of a cluster and it's cluster's center. The Davies-Bouldin index doesn't have a range and starts from 0. The closer the score is to 0 the better; it means the clusters are well separated, which is an indication of a good cluster.
With sklearn, we can easily compute the Davis-Bouldin index by using this code:
metrics.davies_bouldin_score(data_points, k_means_model.labels_)
The output of davies_bouldin_score is as follows:
0.404206621415983
In this case, the Calinski-Harabasz score indicates that our k-means model has some overlapping clusters and an improvement could be made by better separating some of the data points in one of the clusters.
In this activity, you will work on the Wine Quality dataset and, more specifically, on red wine data. This dataset contains data on the quality of 1,599 red wines and the results of their chemical tests.
Your goal will be to build two clustering models (using the mean shift algorithm and agglomerative hierarchical clustering) in order to identify whether wines of similar quality also have similar physicochemical properties. You will also have to evaluate and compare the two clustering models using extrinsic and intrinsic approaches.
Note
The dataset can be found at the following URL: https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
The dataset file can be found on our GitHub repository at https://packt.live/2YYsxuu.
Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
The following steps will help you complete the activity:
The adjusted Rand index
The adjusted mutual information
The V-Measure
The Fowlkes-Mallows score
The Silhouette Coefficient
The Calinski-Harabasz index
The Davies-Bouldin index
The expected output is this:
The values of each score for the mean shift clustering model will be as follows:
The values of each score for the agglomerative hierarchical clustering will be as follows:
Note
The solution to this activity is available on page 368.
By completing this activity, you performed mean shift and agglomerative hierarchical clustering on multiple columns for many products. You also learned how to evaluate a clustering model with an extrinsic and intrinsic approach. Finally, you used the results of your models and their evaluation to find an answer to a real-world problem.
In this chapter, we learned the basics of how clustering works. Clustering is a form of unsupervised learning where the features are given, but not the labels. It is the goal of the clustering algorithms to find the labels based on the similarity of the data points.
We also learned that there are two types of clustering, flat and hierarchical, with the first type requiring the number of clusters to find, whereas the second type finds the optimal number of clusters itself.
The k-means algorithm is an example of flat clustering, whereas mean shift and agglomerative hierarchical clustering are examples of a hierarchical clustering algorithm.
We also learned about the numerous scores to evaluate the performance of a clustering model, with the labels in the extrinsic approach or without the labels in the intrinsic approach.
In Chapter 6, Neural Networks and Deep Learning, you will be introduced to a field that has become popular in this decade due to the explosion of computation power and cheap, scalable online server capacity. This field is the science of neural networks and deep learning.