Performing word lemmatization

In this we will learn how to perform word lemmatization.

Getting ready

Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may end up chopping even the right words, that is, chopping the derivational affixes.

See the following Wikipedia link for the derivational patterns:

http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns

On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.

See Wikipedia for more information on inflection at http://en.wikipedia.org/wiki/Inflection.

In this recipe, we will use NLTK's WordNetLemmatizer.

How to do it…

To begin with, we will load the necessary libraries. Once again, as we did in the previous recipes, we will prepare a text input in order to demonstrate lemmatization. We will then proceed to implement lemmantization in the following way:

# Load Libraries
from nltk import stem

#1. small input to figure out how the three stemmers perform.
input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle']

#2.Perform lemmatization.
wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]
print wn_words

How it works…

Step 1 is very similar to our stemming recipe. We will provide the input. In step 2, we will do the lemmatization. This lemmatizer uses Wordnet's built-in morphy-function.

https://wordnet.princeton.edu/man/morphy.7WN.html

Let's look at the output from the print statement:

[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']

The first thing to strike is the word movie. You can see that it has got this right. Porter and the other algorithms had chopped the last letter e.

There's more…

Let's look into a small example using lemmatizer:

>>> wordnet_lemm.lemmatize('running')
'running'
>>> porter.stem('running')
u'run'
>>> lancaster.stem('running')
'run'
>>> snowball.stem('running')
u'run'

The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right! Then, what has gone wrong with our lemmatizer?

Tip

By default, the lemmatizer assumes that the input is a noun; this can be rectified by passing the POS tag of the word to our lemmatizer, as follows:

>>> wordnet_lemm.lemmatize('running','v') u'run'

See also

  • Performing Tokenization recipe in Chapter 3, Analyzing Data - Explore & Wrangle
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset