Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performing word lemmatization

In this we will learn how to perform word lemmatization.

Getting ready

Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may end up chopping even the right words, that is, chopping the derivational affixes.

See the following Wikipedia link for the derivational patterns:

http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns

On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.

See Wikipedia for more information on inflection at http://en.wikipedia.org/wiki/Inflection.

In this recipe, we will use NLTK's WordNetLemmatizer.

How to do it…

To begin with, we will load the necessary libraries. Once again, as we did in the previous recipes, we will prepare a text input in order to demonstrate lemmatization. We will then proceed to implement lemmantization in the following way:

# Load Libraries
from nltk import stem

#1. small input to figure out how the three stemmers perform.
input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle']

#2.Perform lemmatization.
wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]
print wn_words

How it works…

Step 1 is very similar to our stemming recipe. We will provide the input. In step 2, we will do the lemmatization. This lemmatizer uses Wordnet's built-in morphy-function.

https://wordnet.princeton.edu/man/morphy.7WN.html

Let's look at the output from the print statement:

[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']

The first thing to strike is the word movie. You can see that it has got this right. Porter and the other algorithms had chopped the last letter e.

There's more…

Let's look into a small example using lemmatizer:

>>> wordnet_lemm.lemmatize('running')
'running'
>>> porter.stem('running')
u'run'
>>> lancaster.stem('running')
'run'
>>> snowball.stem('running')
u'run'

The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right! Then, what has gone wrong with our lemmatizer?

Tip

By default, the lemmatizer assumes that the input is a noun; this can be rectified by passing the POS tag of the word to our lemmatizer, as follows:

>>> wordnet_lemm.lemmatize('running','v') u'run'

Table of Contents for
Performing word lemmatization

Performing word lemmatization

Getting ready

How to do it…

How it works…

There's more…

Tip

See also

Table of Contents for Performing word lemmatization

Create new playlist

Sign In

Sign Up

Performing word lemmatization

Getting ready

How to do it…

How it works…

There's more…

Tip

See also

Table of Contents for
Performing word lemmatization