Developing a stemmer for non-English language

Polyglot is a software that is used to provide models called morfessor models that are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. The main aim of the Morpho project is to focus on the creation of morphemes, which is the smallest unit of syntax. Morphemes play an important role in natural language processing. Morphemes are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of Polyglot, morfessor models on the 50,000 tokens of different languages were used.

Let's see the code for obtaining the language table using polyglot:

from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))

The output obtained from preceding code is the languages listed here:

Developing a stemmer for non-English language

The necessary models can be downloaded using the following code:

%%bash
polyglot download morph2.en morph2.ar

[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.ar is already up-to-date!

Consider an example that is used to obtain an output from polyglot:

from polyglot.text import Text, Word
tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"]
for s in tokens:
s=Word(s, language="en")
print("{:<20}{}".format(s,s.morphemes))

unconditional['un','conditional']
precooked['pre','cook','ed']
impossible['im','possible']
painful['pain','ful']
entered['enter','ed']

If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting the text into the original constituents:

sent="Ihopeyoufindthebookinteresting"
para=Text(sent)
para.language="en"
para.morphemes
WordList(['I','hope','you','find','the','book','interesting'])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset