Stemming

Stemming is a text preprocessing task for transforming related or similar variants of a word (such as walking) to its base form (to walk), as they share the same meaning. One of the basic transformation stemming actions is to reduce a plural word to its singular form: apples is reduced to apple, for example. While this is a very simple transformation, more complex ones do exist. We will use the popular Porter stemmer, by Martin Porter, to illustrate this, as shown in the following code:

>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemming = PorterStemmer()
>>> stemming.stem("enjoying")
'enjoy'
>>> stemming.stem("enjoys")
'enjoy'
>>> stemming.stem("enjoyable")
'enjoy'

In this case, stemming has reduced the different verb (enjoying, enjoy) and adjective (enjoyable) forms of a word to its base form, enjoy. The Porter algorithm used by the stemmer utilizes various language-specific rules (in this case, English) to arrive at the stem words. One of these rules is removing suffixes such as ing from the word, as seen in the aforementioned example code. Stemming does not always produce a stem that is a word by itself, as shown in the following example:

>>> stemmer.stem("variation")
'variat'
>>> stemmer.stem("variate")
'variat'

Here, variat itself is not an English word. The nltk.stem.snowball module includes the snowball stemmers for other different languages, such as French, Spanish, German, and so on. Snowball is a stemming language that can be used to create standard rules for stemming in different languages. Just such as with tokenizers, we can create custom stemmers, using the following regular expressions:

>>> regexp_stemmer = RegexpStemmer("able$|ing$",min=4)
>>> regexp_stemmer.stem("flyable")
'fly'
>>> regexp_stemmer.stem("flying")
'fly'

The regex pattern, able$|ing$ ,removes the suffixes able and ing, if present in a word, and min specifies the minimum length of the stemmed word.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset