K-means

K-means is the most widely used flat clustering algorithm. After initializing it with the desired number of clusters, num_clusters, it maintains that number of so-called cluster centroids. Initially, it will pick any num_clusters posts and set the centroids to their feature vector. Then it will go through all other posts and assign them the nearest centroid as their current cluster. Following this, it will move each centroid into the middle of all the vectors of that particular class. This changes, of course, the cluster assignment. Some posts are now nearer to another cluster. So it will update the assignments for those changed posts. This is done as long as the centroids move considerably. After some iterations, the movements will fall below a threshold and we consider clustering to be converged.

Let's play this out with a toy example of posts containing only two words. Each point in the following chart represents one document:

After running one iteration of K-means, that is, taking any two vectors as starting points, assigning the labels to the rest, and updating the cluster centers to be the center point of all points in that cluster, we get the following clustering:

Because the cluster centers moved, we have to reassign the cluster labels and recalculate the cluster centers. After iteration 2, we get the following clustering:

The arrows show the movements of the cluster centers. After ten iterations. as shown in the following screenshot of this example, the cluster centers don't move noticeably anymore (scikit's tolerance threshold is 0.0001 by default):

After the clustering has settled, we just need to note down the cluster centers and their cluster number. For each new document that comes in, we then have to vectorize and compare against all cluster centers. The cluster center with the smallest distance to our new post vector belongs to the cluster we will assign to the new post.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset