Chapter 15

Clustering

In This Chapter

arrow Exploring the potentialities of unsupervised clustering

arrow Making K-means work with small and big data

arrow Trying DBScan as an alternative option

One of the basic abilities that humans have exercised since primitive times is to divide the known world into separate classes where individual objects share common features deemed important by the classifier. Starting with primitive cave dwellers classifying the natural world they lived in, distinguishing plants and animals useful or dangerous for their survival, we arrive at modern times in which marketing departments classify consumers into target segments and then act with proper marketing plans.

Classifying is crucial to our process of building new knowledge because, by gathering similar objects, we can

  • Mention all the items in a class by the same denomination
  • Summarize relevant features by an exemplificative class type
  • Associate particular actions or recall specific knowledge automatically

Dealing with big data streams today requires the same classificatory ability, but on a different scale. To spot unknown groups of signals present in the data, we need specialized algorithms that are both able to learn how to assign examples to certain given classes (the supervised approach) and to spot new interesting classes that we weren’t aware of (unsupervised learning).

Even though your main routine as a data scientist will be to put into practice your predictive skills, you’ll also have to provide useful insight into possible structured information present in your data. For example, you’ll often need to locate new features in order to strengthen the predictive power of your models, find an easy way to make complex comparisons inside the data, and discover communities in social networks.

A data-driven approach to classification, called clustering, will prove to be of great help in achieving success for your data project when you need to provide new insights from scratch.

Clustering techniques are a set of unsupervised classification methods that can create meaningful classes by directly processing your data, without any previous knowledge or hypothesis about the groups that may be present. If all supervised algorithms need labeled examples (class labels), unsupervised ones can figure out by themselves what the most appropriate labels could be.

There are a few kinds of clustering techniques. You can distinguish between them using the guidelines in the following list:

  • Assigning every example to a unique group (partitioning) or to multiple ones (fuzzy clustering)
  • Determining the heuristic — that is, the rule of thumb — that they use to figure out whether an example is part of a group
  • Specifying how they quantify the difference between observations, that is, the so-called distance measure

Most of the time you use partition-clustering techniques (a data point can be part of only one group, so the groups don’t overlap; their membership is distinct) and among partitioning methods, you use K-means the most. But other useful methods are mentioned in this chapter, which are based on agglomerative methods and on data density.

Agglomerative methods link data points into clusters based on their distance. Data density approaches take advantage of the idea that groups are very dense and continuous, so if you notice a decrease in density when exploring a part of a group of points, it could mean that you arrived at one of its borders.

tip Because you normally don’t know what you’re looking for, different methods can provide you with different solutions and points of view on the data. The secret of a successful clustering is to try as many of the recipes as possible, compare the results, and try to find a reason why you can consider certain observations as a group in respect to others.

remember You don’t have to type the source code for this chapter manually. In fact, it’s a lot easier if you use the downloadable source (see the Introduction for download instructions). The source code for this chapter appears in the P4DS4D; 15; Clustering.ipynb source code file.

Clustering with K-means

K-means is an iterative algorithm that has become very popular in machine learning because of its simplicity, speed, and scalability to a large number of data points. The K-means algorithm relies on the idea that there are a specific number of data groups, called clusters. Each data group is scattered around a central point with which they share some key characteristics.

You can actually imagine the central point of a cluster, called a centroid, as a sun. The data points distribute around the centroid like planets. Clusters are also expected to clearly separate from each other, so, as groups of points they are both internally homogeneous and different from each other.

remember The K-means algorithm expects to find clusters in your data. Therefore, it will find them even when none exist! It’s important to check inside the groups to determine whether the group is a true gold nugget.

Given such assumptions, all you have to do is to specify the number of groups you expect (you can use a guess or try a number of possible desirable solutions), and the K-means algorithm will look for them, using a heuristic in order to recover the position of the central points.

The cluster centroids should be evident by their different characteristics and positions from each other. Even if you start by randomly guessing where they could be, in the end, after a few corrections, you always find them by using the many data points that gravitate around them.

Understanding centroid-based algorithms

The procedure for finding the centroids is straightforward:

  1. Guess a K number of clusters.

    K centroids are picked randomly from your data points or chosen so that they are placed in your data in very distant positions from each other. All the other points are assigned to their nearest centroid based on the Euclidean distance.

  2. Form the initial clusters.
  3. Reiterate the clusters until you notice that your solution doesn’t change anymore.

    You recalculate the centroids as an average of all the points present in the group. All the data points are reassigned to the groups based on the distance from the new centroids.

The iterative process of assigning cases to the most plausible centroid and then averaging the assigned ones to find a new centroid will slowly shift the centroid position toward the areas where most data points gravitate. The result is that you end up with the true centroid position.

The procedure has only two weak points that you need to consider. First, you choose the initial centroids randomly, which means that you could start from a bad starting point. As a result, the iterative process will stop at some unlikely solution — for example, having a centroid in the middle of two groups. To ensure that your solution is the most probable, you have to try the algorithm a few times and track the results. The more often you try, the more likely you are to confirm the right solution. The Python Scikit-learn implementation of K-means will do that for you, so you just have to decide how many times you intend to try. (The trade-off is that more iterations produce better results, but each iteration consumes valuable time.)

The second weak point is due to the distance that K-means uses, the Euclidean distance, which is the distance between two points on a plane (a concept that you likely studied at school). In a K-means application, each data point is a vector of features, so when comparing the distance of two points, you do the following:

  1. Create a list containing the differences of the elements in the two vectors.
  2. Square all the elements of the difference vector.
  3. Calculate the square root of the summed elements.

You can try a simple example in Python. Pretend that you have two points, A and B, and they have three numeric features. If A and B are the data representation of two persons, their distinguishing features could be measured in height (cm), weight (kg), and age (years), as shown in the following code:

import numpy as np
A = np.array([165, 50, 22])
B = np.array([185, 80, 21])

The following example shows how to calculate the differences between the three elements, square all the resulting elements, and determine the square root of the squared values:

D = (A-B)
D = D**2
D = np.sqrt(np.sum(D))
print D

45.0

In the end, the Euclidean distance is really just a big sum. When the variables making up the difference vector are significantly different in scale from each other (in our example, the height could have been expressed in meters), you end up with a distance dominated by the elements with the largest scale. It is very important to rescale the variables so that they use a similar scale before applying the K-means algorithm. You can use a fixed range or a statistical normalization with zero mean and unit variance to achieve this goal.

Another problem that may arise, apart from scale, is due to correlation between variables, causing redundancy of information. If two variables are highly correlated, that means that a part of their information content is repeated. Replication implies counting the same information more than once in the summation used to calculate the distance. If you’re not aware of the correlation issue, some variables will dominate your distance measure calculation — a situation that may lead to not finding the useful clusters that you want. The solution is to remove the correlation thanks to a dimensionality reduction algorithm such as Principle Component Analysis (PCA). Scikit-learn has a function in the preprocessing module that can correctly scale your variable, as well as a function for PCA, but it is up to you to remember to use these functions before employing K-means and other clustering techniques using distance measure.

Creating an example with image data

An example with image data demonstrates how to apply the tool and how to get insight from clusters. An ideal example is clustering the handwritten digits dataset provided by the Scikit-learn package. Hand-written numbers are naturally different from each other — they possess variability in that there are several ways to write certain numbers. Of course, we all have different writing styles, so it is natural that each person’s numbers differ slightly. The following code shows how to import the image data.

from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
ground_truth = digits.target

The example begins by importing the digits dataset from Scikit-learn and assigning the data to a variable. It then stores the labels in another variable for later verification. The next step is to process the data using a PCA.

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
pca = PCA(n_components=40)
Cx = pca.fit_transform(scale(X))
print 'Explained variance %0.3f'
     % sum(pca.explained_variance_ratio_)

Explained variance 0.951

By applying a PCA on scaled data, the code addresses the problems of scale and correlation. Even though PCA can recreate the same number of variables as in the input data, the example code drops a few using the n_components parameter. The decision to use 40 components, as compared to the original 64 variables, allows the example to retain most of the original information (95 percent of the original variation in data) and simplify the dataset by removing correlation and some noise.

In this example, the PCA-transformed data appears in the Cx variable. After importing the KMeans class, the code defines its main parameters:

  • n_clusters is the K number of centroids to find
  • n_init is the number of times to try the K-means with different starting centroids. The code needs to test the procedure a sufficient number of times, such as 10, as shown here.

    from sklearn.cluster import KMeans
    clustering = KMeans(n_clusters=10, n_init=10,
         random_state=1)
    clustering.fit(Cx)

After creating the parameters, the clustering class is ready for use. You can apply the fit() method to the Cx variable, which produces a scaled and dimensionally reduced dataset.

Looking for optimal solutions

As mentioned in the previous section, the example is clustering ten different numbers. It’s time to start looking for a solution with K = 10. The following code compares the clustering result to the ground truth — the true labels — in order to determine whether there is any correspondence.

import numpy as np
import pandas as pd
ms = np.column_stack((ground_truth,clustering.labels_))
df = pd.DataFrame(ms,
     columns = ['Ground truth','Clusters'])
pd.crosstab(df['Ground truth'], df['Clusters'],
     margins=True)

Converting our solution, given by the labels variable internal to the clustering class, into a pandas DataFrame allows us to apply a cross tabulation and compare the original labels with the labels derived from clustering. You can observe the results in Figure 15-1. Because rows represent ground truth, you can look for numbers whose majority of observations are split among different clusters. These observations are the handwritten examples that are more difficult to figure out by K-means.

image

Figure 15-1: Cross-tabulation of ground truth and K-means clusters.

Notice how numbers such as seven or zero are concentrated into their own cluster, but others, such as 3 and 9, tend to gather together into the same group, the cluster 1. From such a discovery, you can deduce that certain handwritten numbers are easy to guess, while others aren’t.

Representing the centroids is also useful. You can use statistics to perform this task. However, because the data is made up of pixels, you can visualize the cases that are nearest to each centroid. The following code shows how to perform this task.

import matplotlib.pyplot as plt
for k,img in enumerate(np.argmin(dist,axis=0)):
   cluster = clustering.labels_[img]
   plt.subplot(2, 5, cluster)
   plt.imshow(digits.images[img],cmap='binary',
      interpolation='none')
plt.title('cl '+str(cluster))
plt.show()

Observing the depicted centroids can make clear why the cluster 1 contains most of the numbers 3 and 9 and how a number 8 could be mistaken for a number 1 in cluster 0. In general, reasoning using clusters’ centroids is indeed easy because we have reduced thousands of cases into a few clusters to study and compare.

tip Clustering can help you to summarize huge quantities of data. It is an effective technique for presenting data to a nontechnical audience and for feeding a supervised algorithm with group variables, thus providing them with concentrated, significant information.

Another observation you can make is that even though there are just ten numbers in this example, there are more types of handwritten forms of each, hence the necessity of finding more clusters. Of course, the problem is to determine just how many clusters you need.

You use inertia to measure the viability of a cluster. Inertia is the sum of all the differences between every cluster member and its centroid. If the examples in the group are similar to the centroid, the difference is small and so is the inertia. Inertia as an individual measure reveals little. Moreover, when comparing inertia from different clusters in general, you notice that the more groups you have, the less the inertia. What you want to do instead of using inertia directly is to compare the inertia of a cluster solution with the previous cluster solution. This comparison provides you with the rate of change, a more interpretable measure. To obtain the inertia rate of change in Python, you will have to create a loop. Try progressive cluster solutions inside the loop, recording their value. Here is a script for the handwritten digit example:

inertia = list()
delta_inertia = list()
for k in range(1,21):
   clustering = KMeans(n_clusters=k, n_init=10,
      random_state=1)
   clustering.fit(Cx)
   if inertia: # So we won't compare the solution k==1
       delta_inertia.append(
           inertia[-1] - clustering.inertia_)
   inertia.append(clustering.inertia_)

You use the inertia variable inside the clustering class after fitting the clustering. The inertia variable is a list containing the rate of change of inertia between a solution and the previous one. Here is some code that prints a line graph of the rate of change, as depicted by Figure 15-2.

import matplotlib.pyplot as plt
plt.figure()
plt.plot([k for k in range(2,21)], delta_inertia, 'ko-')
plt.xlabel('Number of clusters')
plt.ylabel('Rate of change of inertia')
plt.show()

image

Figure 15-2: Rate of change of inertia for solutions up to K=20.

When examining inertia’s rate of change, look for jumps in the rate itself. If the rate jumps up, it means that adding a cluster more than the previous solution brings much more benefit than expected; if it jumps down instead, you’re likely forcing a cluster more than necessary. All the cluster solutions before a jump down may be a good candidate, according to the principle of parsimony (the jump signals a sophistication in our analysis, but the right solutions are usually the simplest). In the example, the first jump downward is at K=14, so the first solution to evaluate is K=13. You can see another interesting jump at K=18, so you should also evaluate K=17, which is a peak.

remember The rate of change in inertia will provide you with just a few tips where there could be good cluster solutions. It is up to you to decide which to pick if you need to get some extra insight on data. If, instead, clustering is just a step in a complex data science project, you can just pass the entire solution to the next machine-learning algorithm.

Clustering big data

K-means is a way to reduce the complexity of your data by summarizing the many examples in your dataset. To perform this task, you load the data into your computer’s memory, and that won’t always be feasible, especially if you are working with big data. Scikit-learn offers an alternative way to apply K-means — the MiniBatchKMeans is a variant that can progressively cluster separated chunks of data. In fact, a batch learning procedure usually processes the data part by part. There are only two differences between the standard K-means function and MiniBatchKMeans:

  • You cannot automatically test different starting centroids unless you try running the analysis again.
  • The analysis will start when there is a batch made of at least a minimum number of cases. This value is usually set to 100 (but the more cases there are, the better the result) by the batch_size parameter.

A simple demonstration on the previous handwritten dataset shows how effective and easy it is to use the MiniBatchKMeans clustering class.

from sklearn.cluster import MiniBatchKMeans
batch_clustering = MiniBatchKMeans(n_clusters=10,
   random_state=1)
batch = 100
guessed_labels = list()
inertia = 0
for row in range(0,len(Cx),batch):
    if row+batch < len(Cx):
        feed = Cx[row:row+batch,:]
    else:
        feed = Cx[row:,:]
    batch_clustering.partial_fit(feed)
    # We have to stack results in a list, because
    # MiniBatchKMean does not take track of all the
    # batches
    guessed_labels.append(batch_clustering.labels_)
    inertia += batch_clustering.inertia_
# NumPy hstack turns a list of arrays into an array
# by inspecting the variable guess_labels you can know
# the assigned cluster
guessed_labels = np.hstack(guessed_labels)
print "Kmeans inertia: %0.1f " +
      "MiniBatchKmeans inertia: %0.1f"
      % (clustering.inertia_,inertia)

Kmeans inertia: 48591.7
MiniBatchKmeans inertia: 67027.5

This script iterates through the indexes of the previously scaled and PCA simplified dataset (Cx), creating batches of 100 observations each. Using the partial_fit method, it fits a K-means clustering on each batch, using the centroids found by the previous call. The algorithm stops when it runs out of data. It then reports its inertia for a 10-clusters solution, comparing it with the same solution’s inertia by the standard K-means class. Usually MiniBatchKmeans results in a higher inertia than a standard algorithm, so reserve this solution for those times when you cannot work with in-memory datasets.

Performing Hierarchical Clustering

If the K-means algorithm is concerned with centroids, hierarchical (also known as agglomerative) clustering tries to link each data point, by a distance measure, to its nearest neighbor, creating a cluster. Reiterating the algorithm using different linkage methods, the algorithm gathers all the available points into a rapidly diminishing number of clusters, until in the end all the points reunite into a single group.

The results, if visualized, will closely resemble the biological classifications of living beings that you may have studied in school or seen on posters at the local natural history museum, an upside-down tree whose branches are all converging into a trunk. Such a figurative tree is a dendrogram, and you see it used in medical and biological research. Scikit-learn implementation of agglomerative clustering does not offer the possibility of depicting a dendrogram from your data because such a visualization technique works fine with only a few cases, whereas you can expect to work on many examples.

Compared to K-means, agglomerative algorithms are more cumbersome and do not scale well to large datasets. Agglomerative algorithms are more suitable for statistical studies (they can be easily found in natural sciences, archeology, and sometimes psychology and economics). These algorithms do offer the advantage of creating a complete range of nested cluster solutions, so you just need to pick the right one for your purpose.

To use agglomerative clustering effectively, you have to know about the different linkage methods (the heuristics for clustering) and the distance metrics. There are three linkage methods:

  • Ward: Tends to look for spherical clusters, very cohesive inside and extremely differentiated from other groups. Another nice characteristic is that the method tends to find clusters of similar size. It works only with the Euclidean distance.
  • Complete: Links clusters using their furthest observations, that is, their most dissimilar data points. Consequently, clusters created using this method tend to be comprised of highly similar observations, making the resulting groups quite compact.
  • Average: Links clusters using their centroids and ignoring their boundaries. The method creates larger groups than the complete method. In addition, the clusters can be different sizes and shapes, contrary to the Ward’s solutions. Consequently, this average, multipurpose, approach sees successful use in the field of biological sciences.

There are also three distance metrics:

  • Euclidean (euclidean or l2): As seen in K-means
  • Manhattan (manhattan or l1): Similar to Euclidean, but the distance is calculated by summing the absolute value of the difference between the dimensions. In a map, if the Euclidean distance is the shortest route between two points, the Manhattan distance implies moving straight, first along one axis and then along the other — as a car in the city would, reaching a destination by driving along city blocks (the distance is also known as city block distance).
  • Cosine (cosine): A good choice when there are too many variables and you worry that some variable may not be significant (just noise). Cosine distance reduces noise by taking the shape of the variables, more than their values, into account. It tends to associate observations that have the same maximum and minimum variables, regardless of their effective value.

If your dataset doesn’t contain too many observations, it’s worth trying agglomerative clustering with all the combinations of linkage and distance and then comparing the results carefully. In clustering, you rarely already know right answers, and agglomerative clustering can provide you with another useful potential solution. For example, you can recreate the previous analysis with K-means and handwritten digits, using the ward linkage and the Euclidean distance as follows:

from sklearn.cluster import AgglomerativeClustering
# Affinity = {"euclidean", "l1", "l2", "manhattan",
# "cosine"}
# Linkage = {"ward", "complete", "average"}
Hclustering = AgglomerativeClustering(n_clusters=10,
    affinity='euclidean', linkage='ward')
Hclustering.fit(Cx)
ms = np.column_stack((ground_truth,Hclustering.labels_))
df = pd.DataFrame(ms,
    columns = ['Ground truth','Clusters'])
pd.crosstab(df['Ground truth'], df['Clusters'],
margins=True)

The results, in this case, are comparable to K-means, although, you may have noticed that completing the analysis using this approach takes longer than using K-means. When working with a large number of observations, the computations for a hierarchical cluster solution may take hours to complete, making this solution less feasible. You can get around the time issue by using two-phase clustering, which is faster and provides you with a hierarchical solution even when you are working with large datasets.

To implement the two-phase clustering solution, you process the original observations using K-means with a large number of clusters. A good rule of thumb is to take the square root of the number of observations and use that figure, but you always have to keep the number of clusters in the range of 100–200 for the second phase, based on hierarchical clustering, to work well. The following example uses 100 clusters.

from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=100, n_init=10,
    random_state=1)
clustering.fit(Cx)

At this point, the tricky part is to keep track of what case has been assigned to what cluster derived from K-means. You can use a dictionary for such a purpose.

Kx = clustering.cluster_centers_
Kx_mapping = {case:cluster for case,
   cluster in enumerate(clustering.labels_)}

The new dataset is Kx, which is made up of the cluster centroids that the K-means algorithm has discovered. You can think of each cluster as a well-represented summary of the original data. If you cluster the summary now, it will be almost the same as clustering the original data.

from sklearn.cluster import AgglomerativeClustering
Hclustering = AgglomerativeClustering(n_clusters=10,
   affinity='cosine', linkage='complete')
Hclustering.fit(Kx)

You now map the results to the centroids you originally used so that you can easily determine whether a hierarchical cluster is made of certain K-means centroids. The result consists of the observations making up the K-means clusters having those centroids.

H_mapping = {case:cluster for case,
   cluster in enumerate(Hclustering.labels_)}
final_mapping = {case:H_mapping[Kx_mapping[case]]
   for case in Kx_mapping}

Now you can evaluate the solution you obtained using a similar confusion matrix as you did before for both K-means and hierarchical clustering.

ms = np.column_stack((ground_truth,
 [final_mapping[n] for n in range(max(final_mapping)+1)]))
df = pd.DataFrame(ms,
   columns = ['Ground truth','Clusters'])
pd.crosstab(df['Ground truth'], df['Clusters'],
   margins=True)

The solution you obtain is analogous to the previous solutions. The result proves that this approach is a viable method for handling large datasets or even big data datasets, reducing them to a smaller representations and then operating with less scalable clustering, but more varied and precise techniques. The two-phase approach also presents another advantage because it operates well with noisy or outlying data — the initial K-means phase filters out such problems well and relegates them to separate cluster solutions.

Moving Beyond the Round-Shaped Clusters: DBScan

Both K-means and agglomerative clustering, especially if you are using the Ward’s linkage criteria, will produce cohesive groups, similar to bubbles, equally spread in all directions.

Reality can sometimes produce complex and unsettling results — groups may have strange forms far from the canonical bubble. The Scikit-learns’s datasets module offers a wide range of mind-teasing shapes that you can’t successfully crunch using either K-means or agglomerative clustering: large circles containing smaller ones, interleaved small circles, and spiraling Swiss roll datasets (named after the sponge cake roll because of how the data points are arranged).

DBScan is another clustering algorithm based on a smart intuition that can solve even the most difficult problems. DBScan relies on the idea that clusters are dense, so to start exploring the data space in every direction and mark a cluster boundary when the density decreases should be sufficient. Areas of the data space with insufficient density of points are just considered empty, and all the points there are noise or outliers, that is, points characterized by unusual or strange values.

DBScan is more complex and requires more running time than K-means (but it is faster than agglomerative clustering). It automatically guesses the number of clusters and points out strange data that doesn’t easily fit into any class. This makes DBScan different from the previous algorithms that try to force every observation into a class.

Replicating the handwritten digit clustering requires just a few lines of Python code:

from sklearn.cluster import DBSCAN
DB = DBSCAN(eps=4.35, min_samples=25, random_state=1)
DB.fit(Cx)

Using DBScan, you won’t have to set a K number of expected clusters; the algorithm will find them by itself. Apparently, the lack of a K number seems to simplify the usage of DBScan; in reality, the algorithm requires you to fix two essential parameters, eps and min_sample, in order to work properly:

  • eps: The maximum distance between two observations that allows them to be part of the same neighborhood.
  • min_sample: The minimum number of observations in a neighborhood that transform them into a core point.

The algorithm works by walking around the data and building clusters by linking observations arranged into neighborhoods. A neighborhood is a small cluster of data points all within a distance value of eps. If the number of points in the neighborhood is less than the number min_sample, then DBScan doesn’t form the neighborhood.

No matter what the shape of the cluster, DBScan links all the neighborhoods together if they are near enough (under the distance value of eps). When no more neighborhoods are within reach, DBScan tries to aggregate to group even single data points, if they are within eps distance. The data points that aren’t associated with any group are treated as noisy points (too particular to be part of a group).

remember Try many values of eps and min_sample. The resulting clusters may also change drastically with respect to the values set into these two parameters.

tip Start with a low number of min_samples. Using a lower number allows many neighborhoods to cluster together. The default number 5 is fine. Then try different numbers for eps, starting from 0.1 upward. Don’t be disappointed if you can’t get a viable result initially — keep trying different combinations.

Getting back to the example, after this brief explanation of DBScan details, some data exploration can allow you to observe the results under the right point of view. First, count the clusters:

from collections import Counter
print Counter(DB.labels_)

Counter({-1: 913, 4: 222, 1: 176, 3: 162, 0: 134, 2: 104,
5: 86})

A large number of observations are assigned to the cluster labeled -1, which represents the noise (noise is defined as examples that are too unusual to group). Likely, given the high number of dimensions (40 uncorrelated variables from a PCA analysis) in our data and its high variability (after all, they are handwritten samples), many cases do not naturally fall together into the same group.

At this point, print a visual representation of some example characteristics of the six clusters (as shown in Figure 15-3):

import matplotlib.pyplot as plt
for k,cl in enumerate(np.unique(DB.labels_)):
    if cl >= 0:
        example = np.min(np.where(DB.labels_==cl))
    plt.subplot(2, 3, k)
    plt.imshow(digits.images[example],
        cmap='binary',interpolation='none')
    plt.title('cl '+str(cl))
plt.show()
ms = np.column_stack((ground_truth,DB.labels_))
df = pd.DataFrame(ms,
   columns = ['Ground truth','Clusters'])
pd.crosstab(df['Ground truth'], df['Clusters'],
   margins=True)

image

Figure 15-3: DBScan clusters’ representatives.

The six examples in Figure 15-3 show the numbers 1, 0, 7, 6, 3, and 4 quite clearly. Also, the cross tabulation of cluster ownership with the real labels indicate that DBScan succeeded in finding the numbers precisely and didn’t mix different numbers together.

remember The strength of DBScan is to provide reliable, consistent clusters. After all, DBScan isn’t forced, as are K-means and agglomerative clustering, to reach a solution with a certain number of clusters, even when such a solution does not exist.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset