As you can tell from all the previous recipes in this chapter, there are many different ways to train taggers, and it's impossible to know which methods and parameters will work best without doing training experiments. But training experiments can be tedious, since they often involve many small code changes (and lots of cut and paste) before you converge on an optimal tagger. In an effort to simplify the process, and make my own work easier, I created a project called NLTK-Trainer
.
NLTK-Trainer is a collection of scripts that give you the ability to run training experiments without writing a single line of code. The project is available on GitHub at https://github.com/japerk/nltk-trainer and has documentation at http://nltk-trainer.readthedocs.org/. This recipe will introduce the tagging related scripts, and will show you how to combine many of the previous recipes into a single training command. For download and installation instructions, please go to http://nltk-trainer.readthedocs.org/.
The simplest way to run train_tagger.py
is with the name of an NLTK
corpus. If we use the treebank
corpus, the command and output should look something like this:
$ python train_tagger.py treebank loading treebank 3914 tagged sents, training on 3914 training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None-> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2536> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4933> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=2325> evaluating TrigramTagger accuracy: 0.992372 dumping TrigramTagger to /Users/jacob/nltk_data/taggers/treebank_aubt.pickle
That's all it takes to train a tagger on treebank
and have it dumped to a pickle
file at ~/nltk_data/taggers/treebank_aubt.pickle
. "Wow, and it's over 99% accurate!" I hear you saying. But look closely at the second line of output: 3914 tagged sents, training on 3914
. This means that the tagger was trained on the entire treebank
corpus, and then tested against those same training sentences. This is a very misleading way to evaluate any trained model. In the previous recipes, we used the first 3000 sentences for training and the remaining 914 sentences for testing, or about a 75% split. Here's how to do that with train_tagger.py
, and also skip dumping a pickle
file:
$ python train_tagger.py treebank --fraction 0.75 --no-pickle loading treebank 3914 tagged sents, training on 2936 training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None-> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2287> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4176> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1836> evaluating TrigramTagger accuracy: 0.906082
The train_tagger.py
script roughly performers the following steps:
The first argument to the script is corpus
. This could be the name of an NLTK
corpus that can be found in the nltk.corpus
module, such as treebank
or brown
. It could also be the path to a custom corpus directory. If it's a path to a custom corpus, then you'll also need to use the --reader
argument to specify the corpus reader class, such as nltk.corpus.reader.tagged.TaggedCorpusReader
.
The default training algorithm is aubt
, which is shorthand for a sequential backoff tagger composed of AffixTagger + UnigramTagger + BigramTagger + TrigramTagger
. It's probably easiest to understand by replicating many of the previous recipes using train_tagger.py
. Let's start with a default tagger.
$ python train_tagger.py treebank --no-pickle --default NN --sequential '' loading treebank 3914 tagged sents, training on 3914 evaluating DefaultTagger accuracy: 0.130776
Using --default NN
lets us assign a default tag of NN
, while --sequential ''
disables the default aubt
sequential backoff algorithm. The --fraction
argument is omitted in this case because there's not actually any training happening.
Now let's try a unigram tagger:
$ python train_tagger.py treebank --no-pickle --fraction 0.75 --sequential u loading treebank 3914 tagged sents, training on 2936 training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <DefaultTagger: tag=-None-> evaluating UnigramTagger accuracy: 0.855603
Specifying --sequential u
tells train_tagger.py
to train with a unigram tagger. As we did earlier, we can boost the accuracy a bit by using a default tagger:
$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 --sequential u loading treebank 3914 tagged sents, training on 2936 training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <DefaultTagger: tag=NN> evaluating UnigramTagger accuracy: 0.873462
Now, let's try adding a bigram tagger and trigram tagger:
$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 --sequential ubt loading treebank 3914 tagged sents, training on 2936 training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <DefaultTagger: tag=NN> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=8709> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1836> evaluating TrigramTagger accuracy: 0.879012
The default training algorithm is --sequential aubt
, and the default affix is -3
. But you can modify this with one or more -a
arguments. So, if we want to use an affix of -2
as well as an affix of -3
, you can do the following:
$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 -a -3 -a -2 loading treebank 3914 tagged sents, training on 2936 training AffixTagger with affix -3 and backoff <DefaultTagger: tag=NN> training AffixTagger with affix -2 and backoff <AffixTagger: size=2143> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=248> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=5204> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1838> evaluating TrigramTagger accuracy: 0.908696
The order of multiple -a
arguments matters, and if you switch the order, the results and accuracy will change, because the backoff order changes:
$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 -a -2 -a -3 loading treebank 3914 tagged sents, training on 2936 training AffixTagger with affix -2 and backoff <DefaultTagger: tag=NN> training AffixTagger with affix -3 and backoff <AffixTagger: size=606> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=1313> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4169> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1829> evaluating TrigramTagger accuracy: 0.914367
You can also train a Brill tagger using the --brill
argument. The template bounds the default to (1, 1)
but can be customized with the --template_bounds
argument.
$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 --brill loading treebank 3914 tagged sents, training on 2936 training AffixTagger with affix -3 and backoff <DefaultTagger: tag=NN> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2143> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4179> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1824> Training Brill tagger on 2936 sentences... Finding initial useful rules... Found 1304 useful rules. Selecting rules... evaluating BrillTagger accuracy: 0.909138
Finally, you can train a classifier-based tagger with the --classifier
argument, which specifies the name of a classifier. Be sure to also pass in --sequential ''
because, as we learned previously, training a sequential backoff tagger in addition to a classifier-based tagger is useless. The --default
argument is also useless, because the classifier will always guess something.
$ python train_tagger.py treebank --no-pickle --fraction 0.75 --sequential '' --classifier NaiveBayes loading treebank 3914 tagged sents, training on 2936 training ['NaiveBayes'] ClassifierBasedPOSTagger Constructing training corpus for classifier. Training classifier (75814 instances) training NaiveBayes classifier evaluating ClassifierBasedPOSTagger accuracy: 0.928646
There are a few other classifier algorithms available besides NaiveBayes
, and even more if you have NumPy and SciPy installed.
The train_tagger.py
script supports many other arguments not shown here, all of which you can see by running the script with --help
. A few additional arguments are presented next, followed by an introduction to two other tagging-related scripts available in NLTK-Trainer
.
Without the --no-pickle
argument, train_tagger.py
will save a pickled tagger at ~/nltk_data/taggers/NAME.pickle
, where NAME
is a combination of the corpus name and training algorithm. You can specify a custom filename for your tagger using the --filename
argument like this:
$ python train_tagger.py treebank --filename path/to/tagger.pickle
If you have a custom corpus that you want to use for training a tagger, you can do that by passing in the path to the corpus and the classname of a corpus reader in the --reader
argument. The corpus path can either be absolute or relative to a nltk_data
directory. The corpus reader class must provide a tagged_sents()
method. Here's an example using a relative path to the treebank
tagged corpus:
$ python train_tagger.py corpora/treebank/tagged --reader nltk.corpus.reader.ChunkedCorpusReader --no-pickle --fraction 0.75 loading corpora/treebank/tagged 51002 tagged sents, training on 38252 training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None-> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2092> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4121> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1627> evaluating TrigramTagger accuracy: 0.883175
You can train a tagger with the universal tagset using the --tagset
argument as follows:
$ python train_tagger.py treebank --no-pickle --fraction 0.75 --tagset universal loading treebank using universal tagset 3914 tagged sents, training on 2936 training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None-> training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2287> training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=2889> training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1014> evaluating TrigramTagger accuracy: 0.934800
Because the universal tagset has fewer tags, these taggers tend to be more accurate; this will only work on a corpus that has universal tagset mappings. The universal tagset was covered in the Creating a part-of-speech tagged word corpus recipe in Chapter 3, Creating Custom Corpora.
Every previous example in this chapter has been about training and evaluating a tagger on a single corpus. But how do you know how well that tagger will perform on a different corpus? The analyze_tagger_coverage.py
script gives you a simple way to test the performance of a tagger against another tagged corpus. Here's how to test NLTK's built-in tagger against the treebank
corpus:
$ python analyze_tagger_coverage.py treebank --metrics
The output has been omitted for brevity, but I encourage you to run it yourself to see the results. It's especially useful for evaluating a tagger's performance on a corpus that it was not trained on, such as conll2000
or brown
.
If you only provide a corpus argument, this script will use NLTK's built-in tagger. To evaluate your own tagger, you can use the --tagger
argument, which takes a path to a pickled tagger. The path can be absolute or relative to a nltk_data
directory. For example:
$ python analyze_tagger_coverage.py treebank --metrics --tagger path/to/tagger.pickle
You can also use a custom corpus just like we did earlier with train_tagger.py
, but if your corpus is not tagged, then you must omit the --metrics
argument. In that case, you will only get tag counts, with no notion of accuracy, because there are no tags to compare to.
Finally, there is a script called analyze_tagged_corpus.py
, which, as the name implies, will read in a tagged corpus and print out stats about the number of words and tags. You can run it as follows:
$ python analyze_tagged_corpus.py treebank
The results are available in Appendix A, Penn Treebank Part-of-speech Tags. As with the other commands, you can pass in a custom corpus path and reader to analyze your own tagged corpus.
The previous recipes in this chapter cover the details of the classes and methods that power the functionality of train_tagger.py
. The Training a chunker with NLTK-Trainer recipe at the end of Chapter 5, Extracting Chunks, will introduce NLTK-Trainer's chunking-related scripts, and classification-related scripts will be covered in the Training a classifier with NLTK-Trainer recipe at the end of Chapter 7, Text Classification.