Another look at noise

We should not expect perfect clustering in the sense that posts from the same newsgroup (for example, comp.graphics) are also clustered together. An example will give us a quick impression of the noise that we have to expect. For the sake of simplicity, we will focus on one of the shorter posts:

>>> post_group = zip(train_data.data, train_data.target)
>>> all = [(len(post[0]), post[0], train_data.target_names[post[1]]) 
for post in post_group]
>>> graphics = sorted([post for post in all if post[2]=='comp.graphics']) >>> print(graphics[5]) (245, 'From: [email protected]: test....(sorry)nOrganization: The University of Birmingham, United KingdomnLines: 1nNNTP-Posting-Host: ibm3090.bham.ac.uk<...snip...>',
'comp.graphics')

For this post, there is no real indication that it belongs to comp.graphics, considering only the wording that is left after the preprocessing step:

>>> noise_post = graphics[5][1]
>>> analyzer = vectorizer.build_analyzer()
>>> print(list(analyzer(noise_post)))
['situnaya', 'ibm3090', 'bham', 'ac', 'uk', 'subject', 'test', 
'sorri', 'organ', 'univers', 'birmingham', 'unit', 'kingdom', 'line',
'nntp', 'post', 'host', 'ibm3090', 'bham', 'ac', 'uk']

We received these words after applying tokenization, lowercasing, and stop word removal. If we also subtract those words that will be later filtered out via min_df and max_df, which will be done later in fit_transform, it gets even worse:

>>> useful = set(analyzer(noise_post)).intersection
(vectorizer.get_feature_names())
>>> print(sorted(useful)) ['ac', 'birmingham', 'host', 'kingdom', 'nntp', 'sorri', 'test',
'uk', 'unit', 'univers']

Most of the words occur frequently in other posts as well, as we can see from the IDF scores. Remember that the higher the TF-IDF, the more discriminative a term is for a given post. As IDF is a multiplicative factor here, a low value of it signals that it is not of great value in general:

>>> for term in sorted(useful):
...     print('IDF(%-10s) = %.2f' % (term,                           
...           vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]))
    
IDF(ac        ) = 3.51
IDF(birmingham) = 6.77
IDF(host      ) = 1.74
IDF(kingdom   ) = 6.68
IDF(nntp      ) = 1.77
IDF(sorri     ) = 4.14
IDF(test      ) = 3.83
IDF(uk        ) = 3.70
IDF(unit      ) = 4.42
IDF(univers   ) = 1.91

So, the terms with the highest discriminative power, birmingham and kingdom, are clearly not that computer graphics related, as is the case with the terms with lower IDF scores. Understandably, posts from different newsgroups will be clustered together.

For our goal, however, this is no big deal, as we are only interested in cutting down the number of posts that we have to compare a new post to. After all, the particular newsgroup that our training data came from is of no special interest.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset