In this we will learn how to perform word lemmatization.
Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may end up chopping even the right words, that is, chopping the derivational affixes.
See the following Wikipedia link for the derivational patterns:
http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns
On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.
See Wikipedia for more information on inflection at http://en.wikipedia.org/wiki/Inflection.
In this recipe, we will use NLTK's WordNetLemmatizer
.
To begin with, we will load the necessary libraries. Once again, as we did in the previous recipes, we will prepare a text input in order to demonstrate lemmatization. We will then proceed to implement lemmantization in the following way:
# Load Libraries from nltk import stem #1. small input to figure out how the three stemmers perform. input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle'] #2.Perform lemmatization. wordnet_lemm = stem.WordNetLemmatizer() wn_words = [wordnet_lemm.lemmatize(w) for w in input_words] print wn_words
Step 1 is very similar to our stemming recipe. We will provide the input. In step 2, we will do the lemmatization. This lemmatizer uses Wordnet's built-in morphy-function.
https://wordnet.princeton.edu/man/morphy.7WN.html
Let's look at the output from the print statement:
[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']
The first thing to strike is the word movie. You can see that it has got this right. Porter and the other algorithms had chopped the last letter e.
Let's look into a small example using lemmatizer:
>>> wordnet_lemm.lemmatize('running') 'running' >>> porter.stem('running') u'run' >>> lancaster.stem('running') 'run' >>> snowball.stem('running') u'run'
The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right! Then, what has gone wrong with our lemmatizer?