TnT stands for Trigrams'n'Tags. It is a statistical tagger based on second order Markov models. The details of this are out of the scope of this book, but you can read more about the original implementation at http://www.coli.uni-saarland.de/~thorsten/tnt/.
The TnT
tagger has a slightly different API than the previous taggers we've encountered. You must explicitly call the train()
method after you've created it. Here's a basic example.
>>> from nltk.tag import tnt >>> tnt_tagger = tnt.TnT() >>> tnt_tagger.train(train_sents) >>> tnt_tagger.evaluate(test_sents) 0.8756313403842003
It's quite a good tagger all by itself, only slightly less accurate than the BrillTagger
class from the previous recipe. But if you do not call train()
before evaluate()
, you'll get an accuracy of 0%.
The TnT
tagger maintains a number of internal FreqDist
and ConditionalFreqDist
instances based on the training data. These frequency distributions count unigrams, bigrams, and trigrams. Then, during tagging, the frequencies are used to calculate the probabilities of possible tags for each word. So, instead of constructing a backoff chain of NgramTagger
subclasses, the TnT
tagger uses all the ngram models together to choose the best tag. It also tries to guess the tags for the whole sentence at once by choosing the most likely model for the entire sentence, based on the probabilities of each possible tag.
The TnT
tagger accepts a few optional keyword arguments. You can pass in a tagger for unknown words as unk
. If this tagger is already trained, then you must also pass in Trained=True
. Otherwise, it will call unk.train(data)
with the same data you pass into the train()
method. Since none of the previous taggers have a public train()
method, I recommend always passing Trained=True
if you also pass an unk
tagger. Here's an example using a DefaultTagger
class, which does not require any training.
>>> from nltk.tag import DefaultTagger >>> unk = DefaultTagger('NN') >>> tnt_tagger = tnt.TnT(unk=unk, Trained=True) >>> tnt_tagger.train(train_sents) >>> tnt_tagger.evaluate(test_sents) 0.892467083962875
So, we got an almost 2% increase in accuracy! You must use a tagger that can tag a single word without having seen that word before. This is because the unknown tagger's tag()
method is only called with a single word sentence. Other good candidates for an unknown tagger are RegexpTagger
and AffixTagger
. Passing in a UnigramTagger
class that's been trained on the same data is pretty much useless, as it will have seen the exact same words and, therefore, have the same unknown word blind spots.
Another parameter you can modify for TnT
is N
, which controls the number of possible solutions the tagger maintains while trying to guess the tags for a sentence. N
defaults to 1000
. Increasing it will greatly increase the amount of memory used during tagging, without necessarily increasing the accuracy. Decreasing N
will decrease memory usage, but could also decrease accuracy. Here's what happens when the value is changed to N=100
.
>>> tnt_tagger = tnt.TnT(N=100) >>> tnt_tagger.train(train_sents) >>> tnt_tagger.evaluate(test_sents) 0.8756313403842003
So, the accuracy is exactly the same, but we use significantly less memory to achieve it. However, don't assume that accuracy will not change if you decrease N
; experiment with your own data to be sure.
You can pass C=True
to the TnT
constructor if you want capitalization of words to be significant. The default is C=False
, which means all words are lowercase. The documentation on C
says that treating capitalization as significant probably will not increase accuracy. In my own testing, there was a very slight (< 0.01%) increase in accuracy with C=True
, probably because case-sensitivity can help identify proper nouns.
We have covered the DefaultTagger
class in the Default tagging recipe, backoff tagging in the Combining taggers with backoff tagging recipe, NgramTagger
subclasses in the Training a unigram part-of-speech tagger and Training and combining ngram taggers recipes, RegexpTagger
in the Tagging with regular expressions recipe, and the AffixTagger
class in the Affix tagging recipe.