A unigram generally refers to a single token. Therefore, a unigram tagger only uses a single word as its context for determining the part-of-speech tag.
UnigramTagger
inherits from NgramTagger
, which is a subclass of ContextTagger
, which inherits from SequentialBackoffTagger
. In other words, UnigramTagger
is a context-based tagger whose context is a single word, or unigram.
UnigramTagger
can be trained by giving it a list of tagged sentences at initialization.
>>> from nltk.tag import UnigramTagger >>> from nltk.corpus import treebank >>> train_sents = treebank.tagged_sents()[:3000] >>> tagger = UnigramTagger(train_sents) >>> treebank.sents()[0] ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] >>> tagger.tag(treebank.sents()[0]) [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
We use the first 3000 tagged sentences of the treebank
corpus as the training set to initialize the UnigramTagger
class. Then, we see the first sentence as a list of words, and can see how it is transformed by the tag()
function into a list of tagged tokens.
UnigramTagger
builds a context model from the list of tagged sentences. Because UnigramTagger
inherits from ContextTagger
, instead of providing a choose_tag()
method, it must implement a context()
method, which takes the same three arguments as choose_tag()
. The result of context()
is, in this case, the word token. The context token is used to create the model, and also to look up the best tag once the model is created. Here's an inheritance diagram showing each class, starting at SequentialBackoffTagger
:
Let's see how accurate the UnigramTagger
class is on the test sentences (see the previous recipe for how test_sents
is created).
>>> tagger.evaluate(test_sents) 0.8588819339520829
It has almost 86 % accuracy for a tagger that only uses single word lookup to determine the part-of-speech tag. All accuracy gains from here on will be much smaller.
Actual accuracy values may change each time you run the code. This is because the default iteration order in Python 3 is random. To get consistent accuracy values, run Python with the PYTHONHASHSEED
environment variable set to 0
or any positive integer. For example:
$ PYTHONHASHSEED=0 python chapter4.py
All accuracy values in this book were calculated with PYTHONHASHSEED=0
.
The model building is actually implemented in ContextTagger
. Given the list of tagged sentences, it calculates the frequency that a tag has occurred for each context. The tag with the highest frequency for a context is stored in the model.
All taggers that inherit from ContextTagger
can take a pre-built model instead of training their own. This model is simply a Python dict
mapping a context key to a tag. The context keys will depend on what the ContextTagger
subclass returns from its context()
method. For UnigramTagger
, context keys are individual words. But for other NgramTagger
subclasses, the context keys will be tuples.
Here's an example where we pass a very simple model to the UnigramTagger
class instead of a training set.
>>> tagger = UnigramTagger(model={'Pierre': 'NN'}) >>> tagger.tag(treebank.sents()[0]) [('Pierre', 'NN'), ('Vinken', None), (',', None), ('61', None), ('years', None), ('old', None), (',', None), ('will', None), ('join', None), ('the', None), ('board', None), ('as', None), ('a', None), ('nonexecutive', None), ('director', None), ('Nov.', None), ('29', None), ('.', None)]
Since the model only contained the context key Pierre
, only the first word got a tag. Every other word got None
as the tag since the context word was not in the model. So, unless you know exactly what you are doing, let the tagger train its own model instead of passing in your own.
One good case for passing a self-created model to the UnigramTagger
class is for when you have a dictionary of words and tags, and you know that every word should always map to its tag. Then, you can put this UnigramTagger
as your first backoff tagger (covered in the next recipe) to look up tags for unambiguous words.
The ContextTagger
class uses frequency of occurrence to decide which tag is most likely for a given context. By default, it will do this even if the context word and tag occurs only once. If you'd like to set a minimum frequency threshold, then you can pass a cutoff
value to the UnigramTagger
class.
>>> tagger = UnigramTagger(train_sents, cutoff=3) >>> tagger.evaluate(test_sents) 0.7757392618173969
In this case, using cutoff=3
has decreased accuracy, but there may be times when a cutoff is a good idea.