Clustering

Finally, we have our vectors, which we believe capture the posts to a sufficient degree. Not surprisingly, there are many ways to group them together. One way to classify clustering algorithms is to distinguish between flat and hierarchical clustering.

Flat clustering divides the posts into a set of clusters without relating the clusters to each other. The goal is simply to come up with a partition such that all posts in one cluster are most similar to each other while being dissimilar from the posts in all other clusters. Many flat clustering algorithms require the number of clusters to be specified up front.

In hierarchical clustering, the number of clusters does not have to be specified. Instead, hierarchical clustering creates a hierarchy of clusters. While similar posts are grouped into one cluster, similar clusters are again grouped into one uber-cluster. In the agglomerative clustering approach, for instance, this is done recursively until only one cluster is left that contains everything. In this hierarchy, one can then choose the desired number of clusters after the fact. However, this comes at the cost of lower efficiency.

Scikit provides a wide range of clustering approaches in the sklearn.cluster package. You can get a quick overview of advantages and drawbacks of each of them at http://scikit-learn.org/stable/modules/clustering.html.

In the following sections, we will use the flat clustering method K-means.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset