By the end of this chapter, you will be able to:
In this chapter, we will have a look at some advanced clustering methods and how to record clusters in a dendrogram.
So far, we've learned about some of the most basic algorithms of unsupervised learning: k-means clustering and k-medoids clustering. These are not only important for practical use, but are also important for understanding clustering itself.
In this chapter, we're going to study some other advanced clustering algorithms. We aren't calling them advanced because they are difficult to understand, but because, before using them, a data scientist should have insights into why he or she is using these algorithms instead of the general clustering algorithms we studied in the last chapter. k-means is a general-purpose clustering algorithm that is sufficient for most cases, but in some special cases, depending on the type of data, advanced clustering algorithms can produce better results.
All the types of clustering that we have studied so far are based on a distance metric. But what if we get a dataset in which it's not possible to measure the distance between variables in a traditional sense, as in the case of categorical variables? In such cases, we use k-modes clustering.
k-modes clustering is an extension of k-means clustering, dealing with modes instead of means. One of the major applications of k-modes clustering is analyzing categorical data such as survey results.
In statistics, mode is defined as the most frequently occurring value. So, for k-modes clustering, we're going to calculate the mode of categorical values to choose centers. So, the steps to perform k-modes clustering are as follows:
You might have noticed that these steps are very similar to those for k-means clustering. Only the type of distance metric is changed here. So, if you understand k-means clustering, it will be very easy for you to understand k-modes clustering as well.
For all the exercises and activities where we are importing external CSV's or images, go to RStudio-> Session-> Set Working Directory-> To Source File Location. You can see in the console that the path is set automatically.
The data for this exercise can be downloaded from here: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Exercise10/breast_cancer.csv. This is a categorical dataset that includes nine variables, some categorical and some nominal, describing different breast cancer cases. After saving this data in a file called breast_cancer.csv, we'll do the following:
This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data. This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Exercise10/breast_cancer.csv.
bc_data<-read.csv('breast_cancer.csv',header = FALSE)
k_bc_data<-bc_data[,2:10]
head(k_bc_data)
The output contains the six rows with values for different attributes describing the patient, their symptoms, and their treatment:
V2 V3 V4 V5 V6 V7 V8 V9 V10
1 30-39 premeno 30-34 0-2 no 3 left left_low no
2 40-49 premeno 20-24 0-2 no 2 right right_up no
3 40-49 premeno 20-24 0-2 no 2 left left_low no
4 60-69 ge40 15-19 0-2 no 2 right left_up no
5 40-49 premeno 0-4 0-2 no 2 right right_low no
6 60-69 ge40 15-19 0-2 no 2 left left_low no
install.packages("klaR")
library(klaR)
k.centers<-kmodes(k_bc_data,2,iter.max = 100)
k.centers
The output is as follows:
The clustering algorithm has grouped all of the breast cancer cases into two clusters, with each cluster containing cases that are similar to each other. In the output, there are two main components: the cluster modes and the clustering vector. The cluster modes section is telling us the modes or coordinates of the centers for cluster 1 and cluster 2. Below that, the clustering vector contains the cluster number of each data point in the index sequence.
You could get different results every time you run the algorithm because of the random starting positions of the centers.
As it is a multi-dimensional categorical dataset, there's no easy way to visualize the results other than printing the data to the R console.
This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/datasets/Mushroom. We have downloaded the file, cleaned the data, and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Activity05.
In this activity, we will perform k-modes clustering on the mushroom dataset. This dataset lists the attributes of 23 different species of mushrooms. Each species is classified as being either edible (e) or poisonous (p). We will see how well unsupervised learning can classify poisonous and edible mushrooms by grouping the data into two clusters. These steps will help you complete the activity:
The solution for this activity can be found on page 212.
The output will be a table of true labels and cluster labels, as follows:
1 2
e 80 4128
p 3052 864
Density-based clustering or DBSCAN is one of the most intuitive forms of clustering. It is very easy to find naturally occurring clusters and outliers in data with this type of clustering. Also, it doesn't require you to define a number of clusters. For example, consider the following figure:
There are four natural clusters in this dataset and a few outliers. So, DBSCAN will segregate the clusters and outliers, as depicted in the following figure, without you having to tell it how many clusters to identify in the dataset:
So, DBSCAN can find regions of high density separated by regions of low density in a scatter plot.
As mentioned before, DBSCAN doesn't require you to choose a number of clusters, but you have to choose the other two parameters to perform DBSCAN. The first parameter is commonly denoted by ε (epsilon), which denotes the maximum distance between two points in the same cluster. Another parameter is the minimum number of points in a cluster, which is usually denoted by minPts. Now we'll look at the DBSCAN clustering algorithm step by step:
So, to illustrate of the preceding algorithm, let's take an example with epsilon as x and the minimum number of points in a cluster as 4. Look at the following figure:
Only three points lie within x distance of point R1 and our threshold for the minimum number of points within x radius is four. So, these four points will be classified as outliers or noise. But if there were one more point, R5, somewhere between R1 and R4, all of these four points would belong to a cluster as in the following figure:
In the preceding figure, points R1 and R5 are core points, as they have four points each within x distance from them. And points R4, R2, and R3 are not core points, as illustrated in the following figure:
Any point outside of any of these circles would be classified as a noise point like, R6 in the following figure:
In this exercise, we will have a look at the implementation of DBSCAN using the Iris dataset. To perform DBSCAN, we will execute the following steps:
iris_data<-iris[,1:2]
install.packages("dbscan")
library(dbscan)
clus<-dbscan(iris_data,eps=.3,minPts = 5)
You can experiment with the values of epsilon. To get the desired output, we have set the value as 0.3.
install.packages("factoextra") #use it only the first time if library is not installed already
library(factoextra)
fviz_cluster(clus,data=iris_data,geom = "point",palette="set2",ggtheme=theme_minimal(),ellipse=FALSE)
Here is the output:
In Figure 2.8, there are three clusters in orange, blue, and green. The points in black are noise or outliers. Points in orange here belong to one species, but points in green belong to two different species.
You can make a quick comparison of this scatter plot with Figure 1.18 of Chapter 1, Introduction to Clustering Methods.
DBSCAN is different from all the clustering we've studied so far and is one of the most common and one of the most frequently cited types of clustering methods in scientific literature. There are a few advantages that DBSCAN has over other clustering methods:
But there are also a few disadvantages of DBSCAN:
As DBSCAN is one of the most frequently cited clustering methods, it has many practical uses. Some of the practical real-life uses of DBSCAN clustering are the following:
In this activity, we will perform DBSCAN and compare the results with k-means clustering. To do this, we are going to use the multishapes dataset, which contains simulated data that represents different shapes.
These steps will help you complete the activity:
The solution for this activity can be found on page 213.
The plot of DBSCAN clustering will look as follows:
Here, all the points in black are anomalies and are not classified in any cluster, and the clusters formed in DBSCAN cannot be obtained with any other type of clustering method. These clusters have taken several types of shapes and sizes, whereas, in k-means, all clusters are of approximately spherical shape.
The last type of clustering that we're going to study is hierarchical clustering. A hierarchy is defined as "a system in which people or things are placed in a series of levels with different importance or status." Hierarchical clustering merges clusters sequentially. This sequence of merged clusters is called a hierarchy. We can see that more clearly with the help of the output of a hierarchical clustering algorithm called a dendrogram.
Hierarchical clustering comes in two types:
Since both types of hierarchical clustering are similar, it makes sense to study both of them together.
Agglomerative clustering is also known as the bottom-up approach to hierarchical clustering. In this method, each data point is assumed to be a single cluster at the outset. From there, we start merging the most similar clusters according to a similarity or distance metric until all the data points are merged in a single cluster.
In divisive clustering, we do exactly the opposite. It is a top-down approach to hierarchical clustering. In this method, all the data points are assumed to be in a single cluster initially. From there on, we start splitting the cluster into multiple clusters until each data point is a cluster on its own. Differences and similarities between the two clustering types will become clear in further sections. But first, we should try to understand why we need this other type of clustering and the special purpose it serves that other types of clustering don't serve. Hierarchical clustering is used mainly for the following reasons:
The combination of all the preceding factors make hierarchical clustering an important clustering method in unsupervised learning.
As described previously, agglomerative hierarchical clustering is a bottom-up approach to hierarchical clustering. We merge the most similar clusters one by one based on a similarity metric. This similarity metric can be chosen from one of several different types:
The distance between members of the same cluster is not measured in these similarity metrics.
With the knowledge of these similarity metrics, we can now understand the algorithm to perform agglomerative hierarchical clustering:
This whole process will produce a graph called a dendrogram. This graph records the clusters formed at each step. A simple dendrogram with very few elements would look as follows:
In the preceding dendogram, suppose that point A and point B are the closest among all points on the similarity measure that we are using. Their closeness is used to determine the height of the joining line, which is at L1 in the case of point A and point B. So, point A and point B are clustered first. After that, at L2, point D and point E are clustered, then, at L3, points A, B, and C are clustered. At this point, we have two clusters that are joined together to form one cluster at L4.
Now, to get clusters from this dendrogram, we make horizontal cuts. For example, if we make a cut between L4 and L3, we will get two clusters:
The clusters will look as follows:
Similarly, if we make a horizontal cut in the dendrogram between L3 and L2, we will get three clusters:
The clusters will look as follows:
So, to get a different number of clusters, we didn't need to rerun the process. With this method, we can get any number of clusters possible in the data without executing the whole process again.
In this exercise, we're going to perform agglomerative hierarchical clustering with different similarity measures and compare the results:
iris_data<-iris[,3:5]
install.packages('cluster')
library(cluster)
h.clus<-hclust(dist(iris_data),method="complete")
plot(h.clus)
The output is as follows:
clusterCut <- cutree(h.clus, 3)
table(clusterCut, iris_data$Species)
The output table is as follows:
The setosa and virginica species get classified accurately with this clustering method.
h.clus<-hclust(dist(iris_data),method = "single")
plot(h.clus)
The output is as follows:
Notice how this dendrogram is different from the dendrogram created with the complete similarity metric.
clusterCut <- cutree(h.clus, 3)
table(clusterCut, iris_data$Species)
The output is as follows:
Here, our clustering method is successfully able to separate only one class from the other two.
h.clus<-hclust(dist(iris_data),method = "average")
plot(h.clus)
The output is as follows:
clusterCut <- cutree(h.clus, 3)
table(clusterCut, iris_data$Species)
The output is as follows:
Here, with the average similarity metric, we get almost completely correct classification results.
h.clus<-hclust(dist(iris_data),method = "centroid")
plot(h.clus)
The output is as follows:
clusterCut <- cutree(h.clus, 3)
table(clusterCut, iris_data$Species)
The output is as follows:
Although dendrograms of clusters with the average and centroid similarity metrics look different, they have the same number of elements in each cluster when we cut the dendrogram at three clusters.
Divisive clustering is the opposite of agglomerative clustering. In agglomerative clustering, we start with each point as its own cluster, while in divisive clustering, we start with the whole dataset as one cluster and from there, we start dividing it into more clusters:
So, the divisive clustering process can be summarized in one figure as follows:
From step 1 to step 6, the divisive clustering process keeps on dividing points into further clusters until each point is a cluster on its own. Figure 2.26 shows the first 6 steps of the divisive clustering process, but to correctly run the complete process, we will need to execute as many steps as there are points in the dataset.
So, for divisive clustering, we will use the DIANA algorithm – DIANA stands for Divisive Analysis. For this, we need to carry out the following steps:
We're going to use the cluster library in R to perform DIANA clustering.
In this exercise, we will perform DIANA clustering:
iris_data<-iris[,3:5]
install.packages("cluster")
library("cluster")
h.clus<-diana(iris_data, metric="euclidean")
pltree(h.clus, cex = 0.6, main = "Dendrogram of divisive clustering")
The dendrogram of the divisive clustering results appears as follows:
clusterCut <- cutree(h.clus, 3)
table(clusterCut, iris_data$Species)
The output is as follows:
clusterCut setosa versicolor virginica
1 50 1 0
2 0 49 0
3 0 0 50
In the preceding output, only one flower is misclassified into another category. This is the best performance of all the clustering algorithms we have encountered in this book when it comes to classifying flower species without knowing anything about them.
In this activity, we will perform hierarchical cluster analysis on the seeds dataset. We will see what the results of the clustering are when classifying three types of seeds.
This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at http://archive.ics.uci.edu/ml/machine-learning-databases/00236/. We have downloaded the file, cleaned it, and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Activity07/seeds_data.txt.
These steps will help you complete the activity:
The solution for this activity can be found on page 215.
The output of this activity will be a table that shows how the results of the clustering have performed at classifying the three types of seeds. It will look as follows:
Congratulations on completing the second chapter on clustering techniques! With this, we've covered all the major clustering techniques, including k-modes, DBSCAN, and both types of hierarchical clustering, and we've also looked at what connects them. We can apply these techniques to any type of dataset we may encounter. These new methods, at times, also produced better results on the same dataset that we used in the first chapter. In the next chapter, we're going to study probability distributions and their uses in exploratory data analysis.