Clustering posts

We have already noticed one thing—real data is noisy. The newsgroup dataset is no exception. It even contains invalid characters that will result in UnicodeDecodeError.

We have to tell the vectorizer to ignore them:

>>> vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5,
...              stop_words='english', decode_error='ignore')
>>> vectorized = vectorizer.fit_transform(train_data.data)
>>> num_samples, num_features = vectorized.shape
>>> print("#samples: %d, #features: %d" % (num_samples, num_features))
    #samples: 3529, #features: 4712

We now have a pool of 3529 posts and, extracted for each of them, a feature vector of 4712 dimensions. That is what K-means takes as input. We will fix the cluster size to 50 for this chapter, and hope you are curious enough to try out different values as an exercise:

>>> num_clusters = 50
>>> from sklearn.cluster import KMeans
>>> km = KMeans(n_clusters=num_clusters, n_init=1, verbose=1, random_state=3)
>>> km.fit(vectorized)

That's it. We provided a random state just so that you can get the same results. In real-world applications, you will not do this. After fitting, we can get the clustering information out of members of km. For every vectorized post that has been fit, there is a corresponding integer label in km.labels_:

>>> print("km.labels_=%s" % km.labels_)
km.labels_=[48 23 31 ...,  6  2 22]
>>> print("km.labels_.shape=%s" % km.labels_.shape)
km.labels_.shape=3529

The cluster centers can be accessed via km.cluster_centers_.

In the next section, we will see how we can assign a cluster to a newly arriving post, using km.predict.

Table of Contents for Clustering posts

Create new playlist

Sign In

Sign Up

Table of Contents for
Clustering posts