Understanding stemmer

Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from a word. For example, in the case of the word raining, stemmer would return the root word or stem word rain by removing the affix from raining. In order to increase the accuracy of information retrieval, search engines mostly use stemming to get the stems and store them as indexed words. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter stemming algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply do an instantiation of the PorterStemmer class and then perform stemming by calling the stem method.

Let's see the code for stemming using the PorterStemmer class in NLTK:

>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemmerporter = PorterStemmer()
>>> stemmerporter.stem('working')
'work'
>>> stemmerporter.stem('happiness')
'happi'

The PorterStemmer class has been trained and has knowledge of the many stems and word forms of English. The process of stemming takes place in a series of steps and transforms the word into a shorter word or a word that has a similar meaning to the root word. The Stemmer I interface defines the stem() method, and all the stemmers are inherited from the Stemmer I interface. The inheritance diagram is depicted here:

Understanding stemmer

Another stemming algorithm known as the Lancaster stemming algorithm was introduced at Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster stemming. However, one of the major differences between the two algorithms is that Lancaster stemming involves the use of more words of different sentiments as compared to Porter Stemming.

Let's consider the following code that depicts Lancaster stemming in NLTK:

>>> import nltk
>>> from nltk.stem import LancasterStemmer
>>> stemmerlan=LancasterStemmer()
>>> stemmerlan.stem('working')
'work'
>>> stemmerlan.stem('happiness')
'happy'

We can also build our own stemmer in NLTK using RegexpStemmer. It works by accepting a string and eliminating the string from the prefix or suffix of a word when a match is found.

Let's consider an example of stemming using RegexpStemmer in NLTK:

>>> import nltk
>>> from nltk.stem import RegexpStemmer
>>> stemmerregexp=RegexpStemmer('ing')
>>> stemmerregexp.stem('working')
'work'
>>> stemmerregexp.stem('happiness')
'happiness'
>>> stemmerregexp.stem('pairing')
'pair'

We can use RegexpStemmer in the cases in which stemming cannot be performed using PorterStemmer and LancasterStemmer.

SnowballStemmer is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language in which stemming needs to be performed. Then, using the stem() method, stemming is performed.

Consider the following example of performing stemming in Spanish and French in NLTK using SnowballStemmer:

>>> import nltk
>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
>>> spanishstemmer=SnowballStemmer('spanish')
>>> spanishstemmer.stem('comiendo')
'com'
>>> frenchstemmer=SnowballStemmer('french')
>>> frenchstemmer.stem('manger')
'mang'

Nltk.stem.api consists of the Stemmer I class in which the stem function is performed.

Consider the following code present in NLTK that enables us to perform stemming:

Class StemmerI(object):
"""
It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming.
"""
def stem(self, token):
"""
Eliminate affixes from token and stem is returned.
"""
raise NotImplementedError() 

Let's see the code used to perform stemming using multiple stemmers:

>>> import nltk
>>> from nltk.stem.porter import PorterStemmer
>>> from nltk.stem.lancaster import LancasterStemmer
>>> from nltk.stem import SnowballStemmer
>>> def obtain_tokens():
With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read())
return tokens
>>> def  stemming(filtered):
stem=[]
for x in filtered:
stem.append(PorterStemmer().stem(x))
return stem
>>>  if_name_=="_main_":
tok= obtain_tokens()
>>>print("tokens is %s")%(tok)
>>>stem_tokens= stemming(tok)
>>>print("After stemming is %s")%stem_tokens
>>>res=dict(zip(tok,stem_tokens))
>>>print("{tok:stemmed}=%s")%(result)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset