Stemming

One thing is still missing. We count similar words in different variants as different words. Post 2, for instance, contains imaging and images. It make sense to count them together. After all, it is the same concept they are referring to.

We need a function that reduces words to their specific word stem. Scikit does not contain a stemmer by default. With the Natural Language Toolkit (NLTK), we can download a free software toolkit, which provides a stemmer that we can easily plug into CountVectorizer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset