Training a tagger with NLTK-Trainer

As you can tell from all the previous recipes in this chapter, there are many different ways to train taggers, and it's impossible to know which methods and parameters will work best without doing training experiments. But training experiments can be tedious, since they often involve many small code changes (and lots of cut and paste) before you converge on an optimal tagger. In an effort to simplify the process, and make my own work easier, I created a project called NLTK-Trainer.

NLTK-Trainer is a collection of scripts that give you the ability to run training experiments without writing a single line of code. The project is available on GitHub at https://github.com/japerk/nltk-trainer and has documentation at http://nltk-trainer.readthedocs.org/. This recipe will introduce the tagging related scripts, and will show you how to combine many of the previous recipes into a single training command. For download and installation instructions, please go to http://nltk-trainer.readthedocs.org/.

How to do it...

The simplest way to run train_tagger.py is with the name of an NLTK corpus. If we use the treebank corpus, the command and output should look something like this:

$ python train_tagger.py treebank
loading treebank
3914 tagged sents, training on 3914
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2536>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4933>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=2325>
evaluating TrigramTagger
accuracy: 0.992372
dumping TrigramTagger to /Users/jacob/nltk_data/taggers/treebank_aubt.pickle

That's all it takes to train a tagger on treebank and have it dumped to a pickle file at ~/nltk_data/taggers/treebank_aubt.pickle. "Wow, and it's over 99% accurate!" I hear you saying. But look closely at the second line of output: 3914 tagged sents, training on 3914. This means that the tagger was trained on the entire treebank corpus, and then tested against those same training sentences. This is a very misleading way to evaluate any trained model. In the previous recipes, we used the first 3000 sentences for training and the remaining 914 sentences for testing, or about a 75% split. Here's how to do that with train_tagger.py, and also skip dumping a pickle file:

$ python train_tagger.py treebank --fraction 0.75 --no-pickle
loading treebank
3914 tagged sents, training on 2936
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2287>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4176>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1836>
evaluating TrigramTagger
accuracy: 0.906082

How it works...

The train_tagger.py script roughly performers the following steps:

  1. Construct training and testing sentences from corpus arguments.
  2. Build tagger training function from tagger arguments.
  3. Train a tagger on the training sentences using the training function.
  4. Evaluate and/or save the tagger.

The first argument to the script is corpus. This could be the name of an NLTK corpus that can be found in the nltk.corpus module, such as treebank or brown. It could also be the path to a custom corpus directory. If it's a path to a custom corpus, then you'll also need to use the --reader argument to specify the corpus reader class, such as nltk.corpus.reader.tagged.TaggedCorpusReader.

The default training algorithm is aubt, which is shorthand for a sequential backoff tagger composed of AffixTagger + UnigramTagger + BigramTagger + TrigramTagger. It's probably easiest to understand by replicating many of the previous recipes using train_tagger.py. Let's start with a default tagger.

$ python train_tagger.py treebank --no-pickle --default NN --sequential ''
loading treebank
3914 tagged sents, training on 3914
evaluating DefaultTagger
accuracy: 0.130776

Using --default NN lets us assign a default tag of NN, while --sequential '' disables the default aubt sequential backoff algorithm. The --fraction argument is omitted in this case because there's not actually any training happening.

Now let's try a unigram tagger:

$ python train_tagger.py treebank --no-pickle --fraction 0.75 --sequential u
loading treebank
3914 tagged sents, training on 2936
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <DefaultTagger: tag=-None->
evaluating UnigramTagger
accuracy: 0.855603

Specifying --sequential u tells train_tagger.py to train with a unigram tagger. As we did earlier, we can boost the accuracy a bit by using a default tagger:

$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 --sequential u
loading treebank
3914 tagged sents, training on 2936
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <DefaultTagger: tag=NN>
evaluating UnigramTagger
accuracy: 0.873462

Now, let's try adding a bigram tagger and trigram tagger:

$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 --sequential ubt
loading treebank
3914 tagged sents, training on 2936
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <DefaultTagger: tag=NN>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=8709>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1836>
evaluating TrigramTagger
accuracy: 0.879012

Note

The PYTHONHASHSEED environment variable has been omitted for clarity. This means that when you run train_tagger.py, your output and accuracy may vary. To get consistent accuracy values, run train_tagger.py like this:

$ PYTHONHASHSEED=0 python train_tagger.py treebank …

The default training algorithm is --sequential aubt, and the default affix is -3. But you can modify this with one or more -a arguments. So, if we want to use an affix of -2 as well as an affix of -3, you can do the following:

$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 -a -3 -a -2
loading treebank
3914 tagged sents, training on 2936
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=NN>
training AffixTagger with affix -2 and backoff <AffixTagger: size=2143>
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=248>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=5204>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1838>
evaluating TrigramTagger
accuracy: 0.908696

The order of multiple -a arguments matters, and if you switch the order, the results and accuracy will change, because the backoff order changes:

$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 -a -2 -a -3
loading treebank
3914 tagged sents, training on 2936
training AffixTagger with affix -2 and backoff <DefaultTagger: tag=NN>
training AffixTagger with affix -3 and backoff <AffixTagger: size=606>
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=1313>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4169>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1829>
evaluating TrigramTagger
accuracy: 0.914367

You can also train a Brill tagger using the --brill argument. The template bounds the default to (1, 1) but can be customized with the --template_bounds argument.

$ python train_tagger.py treebank --no-pickle --default NN --fraction 0.75 --brill
loading treebank
3914 tagged sents, training on 2936
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=NN>
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2143>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4179>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1824>
Training Brill tagger on 2936 sentences...
Finding initial useful rules...
    Found 1304 useful rules.
Selecting rules...
evaluating BrillTagger
accuracy: 0.909138

Finally, you can train a classifier-based tagger with the --classifier argument, which specifies the name of a classifier. Be sure to also pass in --sequential '' because, as we learned previously, training a sequential backoff tagger in addition to a classifier-based tagger is useless. The --default argument is also useless, because the classifier will always guess something.

$ python train_tagger.py treebank --no-pickle --fraction 0.75 --sequential '' --classifier NaiveBayes
loading treebank
3914 tagged sents, training on 2936
training ['NaiveBayes'] ClassifierBasedPOSTagger
Constructing training corpus for classifier.
Training classifier (75814 instances)
training NaiveBayes classifier
evaluating ClassifierBasedPOSTagger
accuracy: 0.928646

There are a few other classifier algorithms available besides NaiveBayes, and even more if you have NumPy and SciPy installed.

Note

While classifier-based taggers tend to be more accurate, they are also slower to train, and much slower at tagging. If speed is important to you, I recommend sticking with sequential taggers.

There's more...

The train_tagger.py script supports many other arguments not shown here, all of which you can see by running the script with --help. A few additional arguments are presented next, followed by an introduction to two other tagging-related scripts available in NLTK-Trainer.

Saving a pickled tagger

Without the --no-pickle argument, train_tagger.py will save a pickled tagger at ~/nltk_data/taggers/NAME.pickle, where NAME is a combination of the corpus name and training algorithm. You can specify a custom filename for your tagger using the --filename argument like this:

$ python train_tagger.py treebank --filename path/to/tagger.pickle

Training on a custom corpus

If you have a custom corpus that you want to use for training a tagger, you can do that by passing in the path to the corpus and the classname of a corpus reader in the --reader argument. The corpus path can either be absolute or relative to a nltk_data directory. The corpus reader class must provide a tagged_sents() method. Here's an example using a relative path to the treebank tagged corpus:

$ python train_tagger.py corpora/treebank/tagged --reader nltk.corpus.reader.ChunkedCorpusReader --no-pickle --fraction 0.75
loading corpora/treebank/tagged
51002 tagged sents, training on 38252
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2092>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4121>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1627>
evaluating TrigramTagger
accuracy: 0.883175

Training with universal tags

You can train a tagger with the universal tagset using the --tagset argument as follows:

$ python train_tagger.py treebank --no-pickle --fraction 0.75 --tagset universal
loading treebank
using universal tagset
3914 tagged sents, training on 2936
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2287>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=2889>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=1014>
evaluating TrigramTagger
accuracy: 0.934800

Because the universal tagset has fewer tags, these taggers tend to be more accurate; this will only work on a corpus that has universal tagset mappings. The universal tagset was covered in the Creating a part-of-speech tagged word corpus recipe in Chapter 3, Creating Custom Corpora.

Analyzing a tagger against a tagged corpus

Every previous example in this chapter has been about training and evaluating a tagger on a single corpus. But how do you know how well that tagger will perform on a different corpus? The analyze_tagger_coverage.py script gives you a simple way to test the performance of a tagger against another tagged corpus. Here's how to test NLTK's built-in tagger against the treebank corpus:

$ python analyze_tagger_coverage.py treebank --metrics

The output has been omitted for brevity, but I encourage you to run it yourself to see the results. It's especially useful for evaluating a tagger's performance on a corpus that it was not trained on, such as conll2000 or brown.

If you only provide a corpus argument, this script will use NLTK's built-in tagger. To evaluate your own tagger, you can use the --tagger argument, which takes a path to a pickled tagger. The path can be absolute or relative to a nltk_data directory. For example:

$ python analyze_tagger_coverage.py treebank --metrics --tagger path/to/tagger.pickle

You can also use a custom corpus just like we did earlier with train_tagger.py, but if your corpus is not tagged, then you must omit the --metrics argument. In that case, you will only get tag counts, with no notion of accuracy, because there are no tags to compare to.

Analyzing a tagged corpus

Finally, there is a script called analyze_tagged_corpus.py, which, as the name implies, will read in a tagged corpus and print out stats about the number of words and tags. You can run it as follows:

 $ python analyze_tagged_corpus.py treebank

The results are available in Appendix A, Penn Treebank Part-of-speech Tags. As with the other commands, you can pass in a custom corpus path and reader to analyze your own tagged corpus.

See also

The previous recipes in this chapter cover the details of the classes and methods that power the functionality of train_tagger.py. The Training a chunker with NLTK-Trainer recipe at the end of Chapter 5, Extracting Chunks, will introduce NLTK-Trainer's chunking-related scripts, and classification-related scripts will be covered in the Training a classifier with NLTK-Trainer recipe at the end of Chapter 7, Text Classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset