Training the TnT tagger

TnT stands for Trigrams'n'Tags. It is a statistical tagger based on second order Markov models. The details of this are out of the scope of this book, but you can read more about the original implementation at http://www.coli.uni-saarland.de/~thorsten/tnt/.

How to do it...

The TnT tagger has a slightly different API than the previous taggers we've encountered. You must explicitly call the train() method after you've created it. Here's a basic example.

>>> from nltk.tag import tnt
>>> tnt_tagger = tnt.TnT()
>>> tnt_tagger.train(train_sents)
>>> tnt_tagger.evaluate(test_sents)
0.8756313403842003

It's quite a good tagger all by itself, only slightly less accurate than the BrillTagger class from the previous recipe. But if you do not call train() before evaluate(), you'll get an accuracy of 0%.

How it works...

The TnT tagger maintains a number of internal FreqDist and ConditionalFreqDist instances based on the training data. These frequency distributions count unigrams, bigrams, and trigrams. Then, during tagging, the frequencies are used to calculate the probabilities of possible tags for each word. So, instead of constructing a backoff chain of NgramTagger subclasses, the TnT tagger uses all the ngram models together to choose the best tag. It also tries to guess the tags for the whole sentence at once by choosing the most likely model for the entire sentence, based on the probabilities of each possible tag.

Note

Training is fairly quick, but tagging is significantly slower than the other taggers we've covered. This is due to all the floating point math that must be done to calculate the tag probabilities of each word.

There's more...

The TnT tagger accepts a few optional keyword arguments. You can pass in a tagger for unknown words as unk. If this tagger is already trained, then you must also pass in Trained=True. Otherwise, it will call unk.train(data) with the same data you pass into the train() method. Since none of the previous taggers have a public train() method, I recommend always passing Trained=True if you also pass an unk tagger. Here's an example using a DefaultTagger class, which does not require any training.

>>> from nltk.tag import DefaultTagger
>>> unk = DefaultTagger('NN')
>>> tnt_tagger = tnt.TnT(unk=unk, Trained=True)
>>> tnt_tagger.train(train_sents)
>>> tnt_tagger.evaluate(test_sents)
0.892467083962875

So, we got an almost 2% increase in accuracy! You must use a tagger that can tag a single word without having seen that word before. This is because the unknown tagger's tag() method is only called with a single word sentence. Other good candidates for an unknown tagger are RegexpTagger and AffixTagger. Passing in a UnigramTagger class that's been trained on the same data is pretty much useless, as it will have seen the exact same words and, therefore, have the same unknown word blind spots.

Controlling the beam search

Another parameter you can modify for TnT is N, which controls the number of possible solutions the tagger maintains while trying to guess the tags for a sentence. N defaults to 1000. Increasing it will greatly increase the amount of memory used during tagging, without necessarily increasing the accuracy. Decreasing N will decrease memory usage, but could also decrease accuracy. Here's what happens when the value is changed to N=100.

>>> tnt_tagger = tnt.TnT(N=100)
>>> tnt_tagger.train(train_sents)
>>> tnt_tagger.evaluate(test_sents)
0.8756313403842003

So, the accuracy is exactly the same, but we use significantly less memory to achieve it. However, don't assume that accuracy will not change if you decrease N; experiment with your own data to be sure.

Significance of capitalization

You can pass C=True to the TnT constructor if you want capitalization of words to be significant. The default is C=False, which means all words are lowercase. The documentation on C says that treating capitalization as significant probably will not increase accuracy. In my own testing, there was a very slight (< 0.01%) increase in accuracy with C=True, probably because case-sensitivity can help identify proper nouns.

See also

We have covered the DefaultTagger class in the Default tagging recipe, backoff tagging in the Combining taggers with backoff tagging recipe, NgramTagger subclasses in the Training a unigram part-of-speech tagger and Training and combining ngram taggers recipes, RegexpTagger in the Tagging with regular expressions recipe, and the AffixTagger class in the Affix tagging recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset