Removing less important words

Let's have another look at Post 2. Of its words that are not in the new post, we have most, save, images, and permanently. They are quite different in the overall importance to the post. Words such as most appear very often in all sorts of different contexts and are called stop words. They do not carry as much information and thus should not be weighed as high as words such as images, which don't occur often in different contexts. The best option would be to remove all the words that are so frequent that they do not help us to distinguish between different texts. These words are called stop words.

As this is such a common step in text processing, there is a simple parameter in CountVectorizer to achieve that:

>>> vect_engl = CountVectorizer(min_df=1, stop_words='english')

If you have a clear picture of what kind of stop words you would want to remove, you can also pass a list of them. Setting stop_words to english will use a set of 318 English stop words. To find out which ones, you can use get_stop_words():

>>> sorted(vect_engl.get_stop_words())[0:20]
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst'] 

The new word list is seven words lighter:

>>> X_train_engl = vect_engl.fit_transform(posts)
>>> num_samples_engl, num_features_engl = X_train_engl.shape
>>> print(vect_engl.get_feature_names())  
['actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'learning', 'machine', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'toy']

After discarding stop words, we arrive at the following similarity measurement:

    >>> best_post(X_train_engl, new_post_vec_engl, dist_norm)
    
    === Post 0 with dist=1.41:
        'This is a toy post about machine learning. Actually, it contains not much interesting stuff.'
    === Post 1 with dist=0.86:
        'Imaging databases provide storage capabilities.'
    === Post 2 with dist=0.86:
        'Most imaging databases save images permanently.'
    === Post 3 with dist=0.77:
        'Imaging databases store data.'
    === Post 4 with dist=0.77:
        'Imaging databases store data. Imaging databases store data. Imaging databases store data.'
    
    ==> Best post is 3 with dist=0.77  

Post 2 is now on par with Post 1. It has, however, not changed much overall since our posts are kept short for demonstration purposes. It will become vital when we look at real-world data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset