Training a unigram part-of-speech tagger

A unigram generally refers to a single token. Therefore, a unigram tagger only uses a single word as its context for determining the part-of-speech tag.

UnigramTagger inherits from NgramTagger, which is a subclass of ContextTagger, which inherits from SequentialBackoffTagger. In other words, UnigramTagger is a context-based tagger whose context is a single word, or unigram.

How to do it...

UnigramTagger can be trained by giving it a list of tagged sentences at initialization.

>>> from nltk.tag import UnigramTagger
>>> from nltk.corpus import treebank
>>> train_sents = treebank.tagged_sents()[:3000]
>>> tagger = UnigramTagger(train_sents)
>>> treebank.sents()[0]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
>>> tagger.tag(treebank.sents()[0])
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

We use the first 3000 tagged sentences of the treebank corpus as the training set to initialize the UnigramTagger class. Then, we see the first sentence as a list of words, and can see how it is transformed by the tag() function into a list of tagged tokens.

How it works...

UnigramTagger builds a context model from the list of tagged sentences. Because UnigramTagger inherits from ContextTagger, instead of providing a choose_tag() method, it must implement a context() method, which takes the same three arguments as choose_tag(). The result of context() is, in this case, the word token. The context token is used to create the model, and also to look up the best tag once the model is created. Here's an inheritance diagram showing each class, starting at SequentialBackoffTagger:

How it works...

Let's see how accurate the UnigramTagger class is on the test sentences (see the previous recipe for how test_sents is created).

>>> tagger.evaluate(test_sents)
0.8588819339520829

It has almost 86 % accuracy for a tagger that only uses single word lookup to determine the part-of-speech tag. All accuracy gains from here on will be much smaller.

Note

Actual accuracy values may change each time you run the code. This is because the default iteration order in Python 3 is random. To get consistent accuracy values, run Python with the PYTHONHASHSEED environment variable set to 0 or any positive integer. For example:

$ PYTHONHASHSEED=0 python chapter4.py

All accuracy values in this book were calculated with PYTHONHASHSEED=0.

There's more...

The model building is actually implemented in ContextTagger. Given the list of tagged sentences, it calculates the frequency that a tag has occurred for each context. The tag with the highest frequency for a context is stored in the model.

Overriding the context model

All taggers that inherit from ContextTagger can take a pre-built model instead of training their own. This model is simply a Python dict mapping a context key to a tag. The context keys will depend on what the ContextTagger subclass returns from its context() method. For UnigramTagger, context keys are individual words. But for other NgramTagger subclasses, the context keys will be tuples.

Here's an example where we pass a very simple model to the UnigramTagger class instead of a training set.

>>> tagger = UnigramTagger(model={'Pierre': 'NN'})
>>> tagger.tag(treebank.sents()[0])
[('Pierre', 'NN'), ('Vinken', None), (',', None), ('61', None), ('years', None), ('old', None), (',', None), ('will', None), ('join', None), ('the', None), ('board', None), ('as', None), ('a', None), ('nonexecutive', None), ('director', None), ('Nov.', None), ('29', None), ('.', None)]

Since the model only contained the context key Pierre, only the first word got a tag. Every other word got None as the tag since the context word was not in the model. So, unless you know exactly what you are doing, let the tagger train its own model instead of passing in your own.

One good case for passing a self-created model to the UnigramTagger class is for when you have a dictionary of words and tags, and you know that every word should always map to its tag. Then, you can put this UnigramTagger as your first backoff tagger (covered in the next recipe) to look up tags for unambiguous words.

Minimum frequency cutoff

The ContextTagger class uses frequency of occurrence to decide which tag is most likely for a given context. By default, it will do this even if the context word and tag occurs only once. If you'd like to set a minimum frequency threshold, then you can pass a cutoff value to the UnigramTagger class.

>>> tagger = UnigramTagger(train_sents, cutoff=3)
>>> tagger.evaluate(test_sents)
0.7757392618173969

In this case, using cutoff=3 has decreased accuracy, but there may be times when a cutoff is a good idea.

See also

In the next recipe, we'll cover backoff tagging to combine taggers, and in the Creating a model of likely word tags recipe, we'll learn how to statistically determine tags for very common words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset