Unigram means a single word. In a unigram tagger, a single token is used to find the particular parts-of-speech tag.
Training of UnigramTagger
can be performed by providing it with a list of sentences at the time of initialization.
Let's see the following code in NLTK, which performs UnigramTagger
training:
>>> import nltk >>> from nltk.tag import UnigramTagger >>> from nltk.corpus import treebank >>> training= treebank.tagged_sents()[:7000] >>> unitagger=UnigramTagger(training) >>> treebank.sents()[0] ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] >>> unitagger.tag(treebank.sents()[0]) [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
In the preceding code, we have performed training using the first 7000 sentences of the Treebank corpus.
The hierarchy followed by UnigramTagger
is depicted in the following inheritance diagram:
To evaluate UnigramTagger
, let's see the following code, which calculates the accuracy:
>>> import nltk >>> from nltk.corpus import treebank >>> from nltk.tag import UnigramTagger >>> training= treebank.tagged_sents()[:7000] >>> unitagger=UnigramTagger(training) >>> testing = treebank.tagged_sents()[2000:] >>> unitagger.evaluate(testing) 0.963400866227395
So, it is 96% accurate in correctly performing pos tagging.
Since UnigramTagger
inherits from ContextTagger
, we can map the context key with a specific tag.
Consider the following example of tagging using UnigramTagger
:
>>> import nltk >>> from nltk.corpus import treebank >>> from nltk.tag import UnigramTagger >>> unitag = UnigramTagger(model={'Vinken': 'NN'}) >>> unitag.tag(treebank.sents()[0]) [('Pierre', None), ('Vinken', 'NN'), (',', None), ('61', None), ('years', None), ('old', None), (',', None), ('will', None), ('join', None), ('the', None), ('board', None), ('as', None), ('a', None), ('nonexecutive', None), ('director', None), ('Nov.', None), ('29', None), ('.', None)]
Here, in the preceding code, UnigramTagger
only tags 'Vinken'
with the 'NN'
tag and the rest are tagged with the 'None'
tag since we have provided the tag for the word 'Vinken'
in the context model and no other words are included in the context model.
In a given context, ContextTagger
uses the frequency of a given tag to decide the occurrence of the most probable tag. In order to use minimum threshold frequency, we can pass a specific value to the cutoff value. Let's see the code that evaluates UnigramTagger
:
>>> unitagger = UnigramTagger(training, cutoff=5) >>> unitagger.evaluate(testing) 0.7974218445306567
Backoff tagging may be defined as a feature of SequentialBackoffTagger
. All the taggers are chained together so that if one of the taggers is unable to tag a token, then the token may be passed to the next tagger.
Let's see the following code, which uses back-off tagging. Here, DefaultTagger
and UnigramTagger
are used to tag a token. If any tagger of them is unable to tag a word, then the next tagger may be used to tag it:
>>> import nltk >>> from nltk.tag import UnigramTagger >>> from nltk.tag import DefaultTagger >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> tag1=DefaultTagger('NN') >>> tag2=UnigramTagger(training,backoff=tag1) >>> tag2.evaluate(testing) 0.963400866227395
The subclasses of NgramTagger
areUnigramTagger
, BigramTagger
, and TrigramTagger
. BigramTagger
makes use of the previous tag as contextual information. TrigramTagger
uses the previous two tags as contextual information.
Consider the following code, which illustrates the implementation of BigramTagger
:
>>> import nltk >>> from nltk.tag import BigramTagger >>> from nltk.corpus import treebank >>> training_1= treebank.tagged_sents()[:7000] >>> bigramtagger=BigramTagger(training_1) >>> treebank.sents()[0] ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] >>> bigramtagger.tag(treebank.sents()[0]) [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] >>> testing_1 = treebank.tagged_sents()[2000:] >>> bigramtagger.evaluate(testing_1) 0.922942709936983
Let's see another code for BigramTagger
and TrigramTagger
:
>>> import nltk >>> from nltk.tag import BigramTagger, TrigramTagger >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> bigramtag = BigramTagger(training) >>> bigramtag.evaluate(testing) 0.9190426339881356 >>> trigramtag = TrigramTagger(training) >>> trigramtag.evaluate(testing) 0.9101956195989079
NgramTagger
can be used to generate a tagger for n greater than three as well. Let's see the following code in NLTK, which develops QuadgramTagger
:
>>> import nltk >>> from nltk.corpus import treebank >>> from nltk import NgramTagger >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> quadgramtag = NgramTagger(4, training) >>> quadgramtag.evaluate(testing) 0.9429767842847466
The AffixTagger
is also a ContextTagger
in that makes use of a prefix or suffix as the contextual information.
Let's see the following code, which uses AffixTagger
:
>>> import nltk >>> from nltk.corpus import treebank >>> from nltk.tag import AffixTagger >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> affixtag = AffixTagger(training) >>> affixtag.evaluate(testing) 0.29043249789601167
Let's see the following code, which learns the use of four character prefixes:
>>> import nltk >>> from nltk.tag import AffixTagger >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> prefixtag = AffixTagger(training, affix_length=4) >>> prefixtag.evaluate(testing) 0.21103516226368618
Consider the following code, which learns the use of three character suffixes:
>>> import nltk >>> from nltk.tag import AffixTagger >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> suffixtag = AffixTagger(training, affix_length=-3) >>> suffixtag.evaluate(testing) 0.29043249789601167
Consider the following code in NLTK, which that combines many affix taggers in the back-off chain:
>>> import nltk >>> from nltk.tag import AffixTagger >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> prefixtagger=AffixTagger(training,affix_length=4) >>> prefixtagger.evaluate(testing) 0.21103516226368618 >>> prefixtagger3=AffixTagger(training,affix_length=3,backoff=prefixtagger) >>> prefixtagger3.evaluate(testing) 0.25906767658107027 >>> suffixtagger3=AffixTagger(training,affix_length=-3,backoff=prefixtagger3) >>> suffixtagger3.evaluate(testing) 0.2939630929654946 >>> suffixtagger4=AffixTagger(training,affix_length=-4,backoff=suffixtagger3) >>> suffixtagger4.evaluate(testing) 0.3316090892296324
The TnT is Trigrams n Tags. TnT
is a statistical-based tagger that is based on the second order Markov models.
Let's see the code in NLTK for TnT
:
>>> import nltk >>> from nltk.tag import tnt >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> tnt_tagger=tnt.TnT() >>> tnt_tagger.train(training) >>> tnt_tagger.evaluate(testing) 0.9882176652913768
TnT
computes ConditionalFreqDist
and internalFreqDist
from the training text. These instances are used to compute unigrams, bigrams, and trigrams. In order to choose the best tag, TnT uses the ngram model.
Consider the following code of a DefaultTagger
in which, if the value of the unknown tagger is provided explicitly, then TRAINED
will be set to TRUE
:
>>> import nltk >>> from nltk.tag import DefaultTagger >>> from nltk.tag import tnt >>> from nltk.corpus import treebank >>> testing = treebank.tagged_sents()[2000:] >>> training= treebank.tagged_sents()[:7000] >>> tnt_tagger=tnt.TnT() >>> unknown=DefaultTagger('NN') >>> tagger_tnt=tnt.TnT(unk=unknown,Trained=True) >>> tnt_tagger.train(training) >>> tnt_tagger.evaluate(testing) 0.988238192006897