Training a chunker with NLTK-Trainer

At the end of the previous chapter, Chapter 4, Part-of-speech Tagging, we introduced NLTK-Trainer and the train_tagger.py script. In this recipe, we will cover the script for training chunkers: train_chunker.py.

Note

You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.

How to do it...

As with train_tagger.py, the only required argument to train_chunker.py is the name of a corpus. In this case, we need a corpus that provides a chunked_sents() method, such as treebank_chunk. Here's an example of running train_chunker.py on treebank_chunk:

$ python train_chunker.py treebank_chunk
loading treebank_chunk
4009 chunks, training on 4009
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   97.0%
    Precision:      90.8%
    Recall:         93.9%
    F-Measure:      92.3%
dumping TagChunker to /Users/jacob/nltk_data/chunkers/treebank_chunk_ub.pickle

Just like with train_tagger.py, we can use the --no-pickle argument to skip saving a pickled chunker, and the --fraction argument to limit the training set and evaluate the chunker against a test set:

$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75
loading treebank_chunk
4009 chunks, training on 3007
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   97.3%
    Precision:      91.6%
    Recall:         94.6%
    F-Measure:      93.1%

The score output you see is what you get when you print a ChunkScore object. This ChunkScore is the result of calling the chunker's evaluate() method, and has been explained in more detail earlier in this chapter in the Partial parsing with regular expressions recipe. Surprisingly, the chunker's scores actually increase slightly when using a smaller training set. This may indicate that the chunker training algorithm is susceptible to over-fitting, meaning that too many training examples can cause the chunker to over-value incorrect or noisy data.

Note

The PYTHONHASHSEED environment variable has been omitted for clarity. This means that when you run train_chunker.py, your score values may vary. To get consistent score values, run train_chunker.py like this:

$ PYTHONHASHSEED=0 python train_chunker.py treebank_chunk …

How it works...

The default training algorithm for train_chunker.py is to use a tagger-based chunker composed of a BigramTagger and UnigramTagger class. This is what is meant by the output line training ub TagChunker. The details for how to train a tag chunker have been covered earlier in this chapter in the Training a tagger-based chunker recipe. You can modify this algorithm using the --sequential argument. Here's how to train a UnigramTagger based chunker:

$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 --sequential u
loading treebank_chunk
4009 chunks, training on 3007
training u TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   96.7%
    Precision:      89.7%
    Recall:         93.1%
    F-Measure:      91.3%

And here's how to twith additional BigramTagger and TrigramTagger classes:

$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 --sequential ubt
loading treebank_chunk
4009 chunks, training on 3007
training ubt TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   97.2%
    Precision:      91.6%
    Recall:         94.4%
    F-Measure:      93.0%

You can also train a classifier-based chunker, which was covered in the previous recipe, Classification-based chunking.

$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 --sequential '' --classifier NaiveBayes
loading treebank_chunk
4009 chunks, training on 3007
training ClassifierChunker with ['NaiveBayes'] classifier
Constructing training corpus for classifier.
Training classifier (71088 instances)
training NaiveBayes classifier
evaluating ClassifierChunker
ChunkParse score:
    IOB Accuracy:   97.2%
    Precision:      92.6%
    Recall:         93.6%
    F-Measure:      93.1%

There's more...

The train_chunker.py script supports many other arguments not shown here, all of which you can see by running the script with --help. A few additional arguments are presented next, followed by an introduction to two other chunking-related scripts available in nltk-trainer.

Saving a pickled chunker

Without the --no-pickle argument, train_chunker.py will save a pickled chunker at ~/nltk_data/chunkers/NAME.pickle, where NAME is a combination of the corpus name and training algorithm. You can specify a custom filename for your chunker using the --filename argument like this:

$ python train_chunker.py treebank_chunker --filename path/to/tagger.pickle

Training a named entity chunker

We can use train_chunker.py to replicate the chunker we trained on the ieer corpus in the Training a named entity chunker recipe. This is possible because the special handling required for training on ieer is built-in to NLTK-Trainer.

$ python train_chunker.py ieer --no-pickle --fraction 0.85 --sequential '' --classifier NaiveBayes
loading ieer
converting ieer parsed docs to chunked sentences
94 chunks, training on 80
training ClassifierChunker with ['NaiveBayes'] classifier
Constructing training corpus for classifier.
Training classifier (47000 instances)
training NaiveBayes classifier
evaluating ClassifierChunker
ChunkParse score:
    IOB Accuracy:   88.3%
    Precision:      40.9%
    Recall:         50.5%
    F-Measure:      45.2%

Training on a custom corpus

If you have a custom corpus that you want to use for training a chunker, you can do that by passing in the path to the corpus and the classname of a corpus reader in the --reader argument. The corpus path can either be absolute or relative to a nltk_data directory. The corpus reader class must provide a chunked_sents() method. Here's an example using a relative path to the treebank chunked corpus:

$ python train_chunker.py corpora/treebank/tagged --reader nltk.corpus.reader.ChunkedCorpusReader --no-pickle --fraction 0.75
loading corpora/treebank/tagged
51002 chunks, training on 38252
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   98.4%
    Precision:      97.7%
    Recall:         98.9%
    F-Measure:      98.3%

Training on parse trees

The train_chunker.py script supports two arguments that allow it to train on full parse trees from a corpus reader's parsed_sents() method instead of using chunked sentences. A parse tree differs from a chunk tree in that it can be much deeper, with subphrases and even subphrases of those subphrases. But the chunking algorithms we've covered so far cannot learn from deep parse trees, so we need to flatten them somehow. The first argument is --flatten-deep-tree, which trains chunks from the leaf labels of a parse tree.

$ python train_chunker.py treebank --no-pickle --fraction 0.75 --flatten-deep-tree
loading treebank
flattening deep trees from treebank
3914 chunks, training on 2936
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:  72.4%
    Precision:     51.6%
    Recall:        52.2%
    F-Measure:     51.9%

We use the treebank corpus instead of treebank_chunk, because it has full parse trees accessible with the parsed_sents() method. The other parse tree argument is --shallow-tree, which trains chunks from the top-level labels of a parse tree.

$ python train_chunker.py treebank --no-pickle --fraction 0.75 --shallow-tree
loading treebank
creating shallow trees from treebank
3914 chunks, training on 2936
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:  73.1%
    Precision:     60.0%
    Recall:        56.2%
    F-Measure:     58.0%

These options are more useful for corpora that don't provide chunked sentences, such as cess_cat and cess_esp.

Analyzing a chunker against a chunked corpus

So how do you know how well a chunker will perform on a different corpus that you didn't train it on? The analyze_chunker_coverage.py script gives you a simple way to test the performance of a chunker against another chunked corpus. Here's how to test NLTK's built-in chunker against the treebank_chunk corpus:

$ python analyze_chunker_coverage.py treebank_chunk --score
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
loading chunker chunkers/maxent_ne_chunker/english_ace_multiclass.pickle
evaluating chunker score

ChunkParse   score:
    IOB Accuracy:  45.4%
    Precision:     0.0%
    Recall:        0.0%
    F-Measure:     0.0%

analyzing chunker coverage of treebank_chunk with NEChunkParser


IOB              Found
============  =========
FACILITY          56
GPE               1874
GSP               38
LOCATION          34
ORGANIZATION      1572
PERSON            2108
============  =========

As you can see, NLTK's default chunker does not do well against the treebank_chunk corpus. This is because the default chunker is looking for named entities, not NP phrases. This is shown by the coverage analysis of IOB tags that were found. These results do not necessarily mean that the default chunker is bad, just that it was not trained for finding noun phrases, and thus cannot be accurately evaluated against the treebank_chunk corpus.

While the analyze_chunker_coverage.py script defaults to using NLTK's built-in tagger and chunker, you can evaluate on your own tagger and/or chunker using the --tagger and/or --chunker arguments, both of which accept a path to a pickled tagger or chunker. Consider the following code:

$ python train_chunker.py treebank_chunker --tagger path/to/tagger.pickle --chunker path/to/chunker.pickle

You can also use a custom corpus just like we did earlier with train_chunker.py; however, if your corpus is not chunked, then you must omit the --score argument, because you have nothing to compare the results to. In that case, you will only get IOB tag counts with no scores, because there are no chunks to compare to.

Analyzing a chunked corpus

Finally, there is a script called analyze_chunked_corpus.py, which as the name implies, will read in a chunked corpus and print out stats about the number of words and tags. You can run it like this:

 $ python analyze_chunked_corpus.py treebank_chunk

The results are very similar to analyze_tagged_corpus.py, with additional columns for each IOB tag. Each IOB tag column shows the counts for each part-of-speech tag that was present in chunks for that IOB tag. For example, NN words (nouns) may occur 300 times in total, and for 280 of those times, the NN words occurred with a NP IOB tag, meaning that most nouns occur within noun phrases.

As with the other commands, you can pass in a custom corpus path and reader to analyze your own chunked corpus.

See also

  • The Training a tagger-based chunker, the Classification-based chunking, and the Training a named entity chunker recipes cover many of the ideas that went into the train_chunker.py script
  • In Chapter 4, Part-of-speech Tagging, we showed how to use NLTK-Trainer for training a tagger in the Training a tagger with NLTK-Trainer recipe
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset