Stemming the words

In this we will see how to stem the word.

Getting ready

Standardization of the text is a different beast and we need different tools to tame it. In this section, we will look into how we can convert words to their base forms in order to bring consistency to our processing. We will start with traditional ways that include stemming and lemmatization. English grammar dictates how certain words are used in sentences. For example, perform, performing, and performs indicate the same action; they appear in different sentences based on the grammar rules.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Introduction to Information Retrieval By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze

Let's look into how we can perform word stemming using Python NLTK. NLTK provides us with a rich set of functions that can help us do the stemming pretty easily:

>>> import nltk.stem
>>> dir(nltk.stem)
['ISRIStemmer', 'LancasterStemmer', 'PorterStemmer', 'RSLPStemmer', 'RegexpStemmer', 'SnowballStemmer', 'StemmerI', 'WordNetLemmatizer', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '__path__', 'api', 'isri', 'lancaster', 'porter', 'regexp', 'rslp', 'snowball', 'wordnet']
>>>  

We can see the list of functions in the module, and for our interest, we have the following stemmers:

  • Porter – porter stemmer
  • Lancaster – Lancaster stemmer
  • Snowball – snowball stemmer

Porter is the most commonly used stemmer. The algorithm is not very aggressive when moving words to their root form.

Snowball is an improvement over porter. It is also faster than porter in terms of the computational time.

Lancaster is the most aggressive stemmer. With porter and snowball, the final word tokens would still be readable by humans, but with Lancaster, it is not readable. It's the fastest of the trio.

In this recipe, we will use some of them to see how the stemming of words can be performed.

How to do it…

To begin with, let's load the necessary libraries and declare the dataset against which we would want to demonstrate stemming:

# Load Libraries
from nltk import stem

#1. small input to figure out how the three stemmers perform.
input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks','planted','running','throttle']

Let's jump into the different stemming algorithms, as follows:

#2.Porter Stemming
porter = stem.porter.PorterStemmer()
p_words = [porter.stem(w) for w in input_words]
print p_words

#3.Lancaster Stemming
lancaster = stem.lancaster.LancasterStemmer()
l_words = [lancaster.stem(w) for w in input_words]
print l_words

#4.Snowball stemming
snowball = stem.snowball.EnglishStemmer()
s_words = [snowball.stem(w) for w in input_words]
print s_words

wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]
print wn_words

How it works…

In step 1, we will import the stem module from nltk. We will also create a list of words that we want to stem. If you observe carefully, the words have been chosen to have different suffixes, including s, ies, ed, ing, and so on. Additionally, there are some words in their root form already, such as throttle and fry. The idea is to see how the stemming algorithm treats them.

Steps 2, 3, and 4 are very similar; we will invoke the porter, lancaster, and snowball stemmers on the input and print the output. We will use a list comprehension to apply these words to our input and finally, print the output. Let's look at the print output to understand the effect of stemming:

[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']

This is the output from step 2. Porter stemming was applied to our input words. We can see that the words with the suffixes ies, s, ed , and ing have been reduced to their root forms:

Movies – movi
Dogs   - dog
Planes – plane
Running – run and so on.

It's interesting to note that throttle is changed to throttle.

In step 3, we will print the output of lancaster, which is as follows:

[u'movy', 'dog', 'plan', 'flow', 'fli', 'fri', 'fry', 'week', 'plant', 'run', 'throttle']

The word throttle has been left as it is. Note what has happened to movies.

Similarly, let's look at the output produced by the snowball stemmer in step 4:

[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']

The output is pretty similar to the porter stemmer.

There's more…

All the three algorithms are pretty involved; going into the details of these algorithms is beyond the scope of this book. I will recommend you to look to the web for more details on these algorithms.

For details of the porter and snowball stemmers, refer to the following link:

http://snowball.tartarus.org/algorithms/porter/stemmer.html

See also

  • List Comprehension recipe in Chapter 1, Using Python for Data Science
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset