Parts-of-speech (POS) tagging is one of the many tasks in NLP. It is defined as the process of assigning a particular parts-of-speech tag to individual words in a sentence. The parts-of-speech tag identifies whether a word is a noun, verb, adjective, and so on. There are numerous applications of parts-of-speech tagging, such as information retrieval, machine translation, NER, language analysis, and so on.
This chapter will include the following topics:
Parts-of-speech tagging is the process of assigning a category (for example, noun, verb, adjective, and so on) tag to individual tokens in a sentence. In NLTK, taggers are present in the nltk.tag
package and it is inherited by the TaggerIbase
class.
Consider an example to implement POS tagging for a given sentence in NLTK:
>>> import nltk >>> text1=nltk.word_tokenize("It is a pleasant day today") >>> nltk.pos_tag(text1) [('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('pleasant', 'JJ'), ('day', 'NN'), ('today', 'NN')]
We can implement the tag()
method in all the subclasses of TaggerI
. In order to evaluate tagger, TaggerI
has provided the evaluate()
method. A combination of taggers can be used to form a back-off chain so that the next tagger can be used for tagging if one tagger is not tagging.
Let's see the list of available tags provided by Penn Treebank (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):
CC - Coordinating conjunction CD - Cardinal number DT - Determiner EX - Existential there FW - Foreign word IN - Preposition or subordinating conjunction JJ - Adjective JJR - Adjective, comparative JJS - Adjective, superlative LS - List item marker MD - Modal NN - Noun, singular or mass NNS - Noun, plural NNP - Proper noun, singular NNPS - Proper noun, plural PDT - Predeterminer POS - Possessive ending PRP - Personal pronoun PRP$ - Possessive pronoun (prolog version PRP-S) RB - Adverb RBR - Adverb, comparative RBS - Adverb, superlative RP - Particle SYM - Symbol TO - to UH - Interjection VB - Verb, base form VBD - Verb, past tense VBG - Verb, gerund or present participle VBN - Verb, past participle VBP - Verb, non-3rd person singular present VBZ - Verb, 3rd person singular present WDT - Wh-determiner WP - Wh-pronoun WP$ - Possessive wh-pronoun (prolog version WP-S) WRB - Wh-adverb
NLTK may provide the information of tags. Consider the following code, which provides information about the NNS
tag:
>>> nltk.help.upenn_tagset('NNS') NNS: noun, common, plural undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ...
Let's see another example in which a regular expression may also be queried:
>>> nltk.help.upenn_tagset('VB.*') VB: verb, base form ask assemble assess assign assume atone attention avoid bake balkanize bank begin behold believe bend benefit bevel beware bless boil bomb boost brace break bring broil brush build ... VBD: verb, past tense dipped pleaded swiped regummed soaked tidied convened halted registered cushioned exacted snubbed strode aimed adopted belied figgered speculated wore appreciated contemplated ... VBG: verb, present participle or gerund telegraphing stirring focusing angering judging stalling lactating hankerin' alleging veering capping approaching traveling besieging encrypting interrupting erasing wincing ... VBN: verb, past participle multihulled dilapidated aerosolized chaired languished panelized used experimented flourished imitated reunifed factored condensed sheared unsettled primed dubbed desired ... VBP: verb, present tense, not 3rd person singular predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold postpone sever return wag ... VBZ: verb, present tense, 3rd person singular bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes uses bolsters slaps speaks pleads ...R
The preceding code gives information regarding all the tags of verb phrases.
Let's look at an example that depicts words' sense disambiguation achieved through POS tagging:
>>> import nltk >>> text=nltk.word_tokenize("I cannot bear the pain of bear") >>> nltk.pos_tag(text) [('I', 'PRP'), ('can', 'MD'), ('not', 'RB'), ('bear', 'VB'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('bear', 'NN')]
Here, in the previous sentence, bear
is a verb, which means to tolerate, and it also is an animal, which means that it is a noun.
In NLTK, a tagged token is represented as a tuple consisting of a token and its tag. We can create this tuple in NLTK using the str2tuple()
function:
>>> import nltk >>> taggedword=nltk.tag.str2tuple('bear/NN') >>> taggedword ('bear', 'NN') >>> taggedword[0] 'bear' >>> taggedword[1] 'NN'
Let's consider an example in which sequences of tuples can be generated from the given text:
>>> import nltk >>> sentence='''The/DT sacred/VBN Ganga/NNP flows/VBZ in/IN this/DT region/NN ./. This/DT is/VBZ a/DT pilgrimage/NN ./. People/NNP from/IN all/DT over/IN the/DT country/NN visit/NN this/DT place/NN ./. ''' >>> [nltk.tag.str2tuple(t) for t in sentence.split()] [('The', 'DT'), ('sacred', 'VBN'), ('Ganga', 'NNP'), ('flows', 'VBZ'), ('in', 'IN'), ('this', 'DT'), ('region', 'NN'), ('.', '.'), ('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pilgrimage', 'NN'), ('.', '.'), ('People', 'NNP'), ('from', 'IN'), ('all', 'DT'), ('over', 'IN'), ('the', 'DT'), ('country', 'NN'), ('visit', 'NN'), ('this', 'DT'), ('place', 'NN'), ('.', '.')]
Now, consider the following code that converts the tuple (word
and pos
tag) into a word and a tag:
>>> import nltk >>> taggedtok = ('bear', 'NN') >>> from nltk.tag.util import tuple2str >>> tuple2str(taggedtok) 'bear/NN'
Let's see the occurrence of some common tags in the Treebank corpus:
>>> import nltk >>> from nltk.corpus import treebank >>> treebank_tagged = treebank.tagged_words(tagset='universal') >>> tag = nltk.FreqDist(tag for (word, tag) in treebank_tagged) >>> tag.most_common() [('NOUN', 28867), ('VERB', 13564), ('.', 11715), ('ADP', 9857), ('DET', 8725), ('X', 6613), ('ADJ', 6397), ('NUM', 3546), ('PRT', 3219), ('ADV', 3171), ('PRON', 2737), ('CONJ', 2265)]
Consider the following code, which calculates the number of tags occurring before a noun tag:
>>> import nltk >>> from nltk.corpus import treebank >>> treebank_tagged = treebank.tagged_words(tagset='universal') >>> tagpairs = nltk.bigrams(treebank_tagged) >>> preceders_noun = [x[1] for (x, y) in tagpairs if y[1] == 'NOUN'] >>> freqdist = nltk.FreqDist(preceders_noun) >>> [tag for (tag, _) in freqdist.most_common()] ['NOUN', 'DET', 'ADJ', 'ADP', '.', 'VERB', 'NUM', 'PRT', 'CONJ', 'PRON', 'X', 'ADV']
We can also provide POS tags to tokens using dictionaries in Python. Let's see the following code that illustrates the creation of a tuple (word:pos
tag) using dictionaries in Python:
>>> import nltk >>> tag={} >>> tag {} >>> tag['beautiful']='ADJ' >>> tag {'beautiful': 'ADJ'} >>> tag['boy']='N' >>> tag['read']='V' >>> tag['generously']='ADV' >>> tag {'boy': 'N', 'beautiful': 'ADJ', 'generously': 'ADV', 'read': 'V'}
Default tagging is a kind of tagging that assigns identical parts-of-speech tags to all the tokens. The subclass of SequentialBackoffTagger
is DefaultTagger
. The choose_tag()
method must be implemented by SequentialBackoffTagger
. This method includes the following arguments:
The hierarchy of tagger is depicted as follows:
Let's now see the following code, which depicts the working of DefaultTagger
:
>>> import nltk >>> from nltk.tag import DefaultTagger >>> tag = DefaultTagger('NN') >>> tag.tag(['Beautiful', 'morning']) [('Beautiful', 'NN'), ('morning', 'NN')]
We can convert a tagged sentence into an untagged sentence with the help of nltk.tag.untag()
. After calling this function, the tags on individual tokens will be eliminated.
Let's see the code for untagging a sentence:
>>> from nltk.tag import untag >>> untag([('beautiful', 'NN'), ('morning', 'NN')]) ['beautiful', 'morning']