Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Training the TnT tagger

TnT stands for Trigrams'n'Tags. It is a statistical tagger based on second order Markov models. The details of this are out of the scope of this book, but you can read more about the original implementation at http://www.coli.uni-saarland.de/~thorsten/tnt/.

How to do it...

The TnT tagger has a slightly different API than the previous taggers we've encountered. You must explicitly call the train() method after you've created it. Here's a basic example.

>>> from nltk.tag import tnt
>>> tnt_tagger = tnt.TnT()
>>> tnt_tagger.train(train_sents)
>>> tnt_tagger.evaluate(test_sents)
0.8756313403842003

It's quite a good tagger all by itself, only slightly less accurate than the BrillTagger class from the previous recipe. But if you do not call train() before evaluate(), you'll get an accuracy of 0%.

How it works...

The TnT tagger maintains a number of internal FreqDist and ConditionalFreqDist instances based on the training data. These frequency distributions count unigrams, bigrams, and trigrams. Then, during tagging, the frequencies are used to calculate the probabilities of possible tags for each word. So, instead of constructing a backoff chain of NgramTagger subclasses, the TnT tagger uses all the ngram models together to choose the best tag. It also tries to guess the tags for the whole sentence at once by choosing the most likely model for the entire sentence, based on the probabilities of each possible tag.

Note

Training is fairly quick, but tagging is significantly slower than the other taggers we've covered. This is due to all the floating point math that must be done to calculate the tag probabilities of each word.

There's more...

The TnT tagger accepts a few optional keyword arguments. You can pass in a tagger for unknown words as unk. If this tagger is already trained, then you must also pass in Trained=True. Otherwise, it will call unk.train(data) with the same data you pass into the train() method. Since none of the previous taggers have a public train() method, I recommend always passing Trained=True if you also pass an unk tagger. Here's an example using a DefaultTagger class, which does not require any training.

>>> from nltk.tag import DefaultTagger
>>> unk = DefaultTagger('NN')
>>> tnt_tagger = tnt.TnT(unk=unk, Trained=True)
>>> tnt_tagger.train(train_sents)
>>> tnt_tagger.evaluate(test_sents)
0.892467083962875

So, we got an almost 2% increase in accuracy! You must use a tagger that can tag a single word without having seen that word before. This is because the unknown tagger's tag() method is only called with a single word sentence. Other good candidates for an unknown tagger are RegexpTagger and AffixTagger. Passing in a UnigramTagger class that's been trained on the same data is pretty much useless, as it will have seen the exact same words and, therefore, have the same unknown word blind spots.

Controlling the beam search

Another parameter you can modify for TnT is N, which controls the number of possible solutions the tagger maintains while trying to guess the tags for a sentence. N defaults to 1000. Increasing it will greatly increase the amount of memory used during tagging, without necessarily increasing the accuracy. Decreasing N will decrease memory usage, but could also decrease accuracy. Here's what happens when the value is changed to N=100.

>>> tnt_tagger = tnt.TnT(N=100)
>>> tnt_tagger.train(train_sents)
>>> tnt_tagger.evaluate(test_sents)
0.8756313403842003

So, the accuracy is exactly the same, but we use significantly less memory to achieve it. However, don't assume that accuracy will not change if you decrease N; experiment with your own data to be sure.

Significance of capitalization

You can pass C=True to the TnT constructor if you want capitalization of words to be significant. The default is C=False, which means all words are lowercase. The documentation on C says that treating capitalization as significant probably will not increase accuracy. In my own testing, there was a very slight (< 0.01%) increase in accuracy with C=True, probably because case-sensitivity can help identify proper nouns.

Table of Contents for
Training the TnT tagger

Training the TnT tagger

How to do it...

How it works...

Note

There's more...

Controlling the beam search

Significance of capitalization

See also

Table of Contents for Training the TnT tagger

Create new playlist

Sign In

Sign Up

Training the TnT tagger

How to do it...

How it works...

Note

There's more...

Controlling the beam search

Significance of capitalization

See also

Table of Contents for
Training the TnT tagger