At the end of the previous chapter, Chapter 4, Part-of-speech Tagging, we introduced NLTK-Trainer and the train_tagger.py
script. In this recipe, we will cover the script for training chunkers: train_chunker.py
.
You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.
As with train_tagger.py
, the only required argument to train_chunker.py
is the name of a corpus. In this case, we need a corpus that provides a chunked_sents()
method, such as treebank_chunk
. Here's an example of running train_chunker.py
on treebank_chunk
:
$ python train_chunker.py treebank_chunk loading treebank_chunk 4009 chunks, training on 4009 training ub TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 97.0% Precision: 90.8% Recall: 93.9% F-Measure: 92.3% dumping TagChunker to /Users/jacob/nltk_data/chunkers/treebank_chunk_ub.pickle
Just like with train_tagger.py
, we can use the --no-pickle
argument to skip saving a pickled chunker, and the --fraction
argument to limit the training set and evaluate the chunker against a test set:
$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 loading treebank_chunk 4009 chunks, training on 3007 training ub TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 97.3% Precision: 91.6% Recall: 94.6% F-Measure: 93.1%
The score output you see is what you get when you print a ChunkScore
object. This ChunkScore
is the result of calling the chunker's evaluate()
method, and has been explained in more detail earlier in this chapter in the Partial parsing with regular expressions recipe. Surprisingly, the chunker's scores actually increase slightly when using a smaller training set. This may indicate that the chunker training algorithm is susceptible to over-fitting, meaning that too many training examples can cause the chunker to over-value incorrect or noisy data.
The default training algorithm for train_chunker.py
is to use a tagger-based chunker composed of a BigramTagger
and UnigramTagger
class. This is what is meant by the output line training ub TagChunker
. The details for how to train a tag chunker have been covered earlier in this chapter in the Training a tagger-based chunker recipe. You can modify this algorithm using the --sequential
argument. Here's how to train a UnigramTagger
based chunker:
$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 --sequential u loading treebank_chunk 4009 chunks, training on 3007 training u TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 96.7% Precision: 89.7% Recall: 93.1% F-Measure: 91.3%
And here's how to twith additional BigramTagger and TrigramTagger classes:
$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 --sequential ubt loading treebank_chunk 4009 chunks, training on 3007 training ubt TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 97.2% Precision: 91.6% Recall: 94.4% F-Measure: 93.0%
You can also train a classifier-based chunker, which was covered in the previous recipe, Classification-based chunking.
$ python train_chunker.py treebank_chunk --no-pickle --fraction 0.75 --sequential '' --classifier NaiveBayes loading treebank_chunk 4009 chunks, training on 3007 training ClassifierChunker with ['NaiveBayes'] classifier Constructing training corpus for classifier. Training classifier (71088 instances) training NaiveBayes classifier evaluating ClassifierChunker ChunkParse score: IOB Accuracy: 97.2% Precision: 92.6% Recall: 93.6% F-Measure: 93.1%
The train_chunker.py
script supports many other arguments not shown here, all of which you can see by running the script with --help
. A few additional arguments are presented next, followed by an introduction to two other chunking-related scripts available in nltk-trainer
.
Without the --no-pickle
argument, train_chunker.py
will save a pickled chunker at ~/nltk_data/chunkers/NAME.pickle
, where NAME
is a combination of the corpus name and training algorithm. You can specify a custom filename for your chunker using the --filename
argument like this:
$ python train_chunker.py treebank_chunker --filename path/to/tagger.pickle
We can use train_chunker.py
to replicate the chunker we trained on the ieer
corpus in the Training a named entity chunker recipe. This is possible because the special handling required for training on ieer
is built-in to NLTK-Trainer.
$ python train_chunker.py ieer --no-pickle --fraction 0.85 --sequential '' --classifier NaiveBayes loading ieer converting ieer parsed docs to chunked sentences 94 chunks, training on 80 training ClassifierChunker with ['NaiveBayes'] classifier Constructing training corpus for classifier. Training classifier (47000 instances) training NaiveBayes classifier evaluating ClassifierChunker ChunkParse score: IOB Accuracy: 88.3% Precision: 40.9% Recall: 50.5% F-Measure: 45.2%
If you have a custom corpus that you want to use for training a chunker, you can do that by passing in the path to the corpus and the classname of a corpus reader in the --reader
argument. The corpus path can either be absolute or relative to a nltk_data
directory. The corpus reader class must provide a chunked_sents()
method. Here's an example using a relative path to the treebank
chunked corpus:
$ python train_chunker.py corpora/treebank/tagged --reader nltk.corpus.reader.ChunkedCorpusReader --no-pickle --fraction 0.75 loading corpora/treebank/tagged 51002 chunks, training on 38252 training ub TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 98.4% Precision: 97.7% Recall: 98.9% F-Measure: 98.3%
The train_chunker.py
script supports two arguments that allow it to train on full parse trees from a corpus reader's parsed_sents()
method instead of using chunked sentences. A parse tree differs from a chunk tree in that it can be much deeper, with subphrases and even subphrases of those subphrases. But the chunking algorithms we've covered so far cannot learn from deep parse trees, so we need to flatten them somehow. The first argument is --flatten-deep-tree
, which trains chunks from the leaf labels of a parse tree.
$ python train_chunker.py treebank --no-pickle --fraction 0.75 --flatten-deep-tree loading treebank flattening deep trees from treebank 3914 chunks, training on 2936 training ub TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 72.4% Precision: 51.6% Recall: 52.2% F-Measure: 51.9%
We use the treebank
corpus instead of treebank_chunk
, because it has full parse trees accessible with the parsed_sents()
method. The other parse tree argument is --shallow-tree
, which trains chunks from the top-level labels of a parse tree.
$ python train_chunker.py treebank --no-pickle --fraction 0.75 --shallow-tree loading treebank creating shallow trees from treebank 3914 chunks, training on 2936 training ub TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 73.1% Precision: 60.0% Recall: 56.2% F-Measure: 58.0%
These options are more useful for corpora that don't provide chunked sentences, such as cess_cat
and cess_esp
.
So how do you know how well a chunker will perform on a different corpus that you didn't train it on? The analyze_chunker_coverage.py
script gives you a simple way to test the performance of a chunker against another chunked corpus. Here's how to test NLTK's built-in chunker against the treebank_chunk
corpus:
$ python analyze_chunker_coverage.py treebank_chunk --score loading tagger taggers/maxent_treebank_pos_tagger/english.pickle loading chunker chunkers/maxent_ne_chunker/english_ace_multiclass.pickle evaluating chunker score ChunkParse score: IOB Accuracy: 45.4% Precision: 0.0% Recall: 0.0% F-Measure: 0.0% analyzing chunker coverage of treebank_chunk with NEChunkParser IOB Found ============ ========= FACILITY 56 GPE 1874 GSP 38 LOCATION 34 ORGANIZATION 1572 PERSON 2108 ============ =========
As you can see, NLTK's default chunker does not do well against the treebank_chunk
corpus. This is because the default chunker is looking for named entities, not NP
phrases. This is shown by the coverage analysis of IOB tags that were found. These results do not necessarily mean that the default chunker is bad, just that it was not trained for finding noun phrases, and thus cannot be accurately evaluated against the treebank_chunk
corpus.
While the analyze_chunker_coverage.py
script defaults to using NLTK's built-in tagger and chunker, you can evaluate on your own tagger and/or chunker using the --tagger
and/or --chunker
arguments, both of which accept a path to a pickled tagger or chunker. Consider the following code:
$ python train_chunker.py treebank_chunker --tagger path/to/tagger.pickle --chunker path/to/chunker.pickle
You can also use a custom corpus just like we did earlier with train_chunker.py
; however, if your corpus is not chunked, then you must omit the --score
argument, because you have nothing to compare the results to. In that case, you will only get IOB tag counts with no scores, because there are no chunks to compare to.
Finally, there is a script called analyze_chunked_corpus.py
, which as the name implies, will read in a chunked corpus and print out stats about the number of words and tags. You can run it like this:
$ python analyze_chunked_corpus.py treebank_chunk
The results are very similar to analyze_tagged_corpus.py
, with additional columns for each IOB tag. Each IOB tag column shows the counts for each part-of-speech tag that was present in chunks for that IOB tag. For example, NN
words (nouns) may occur 300 times in total, and for 280 of those times, the NN
words occurred with a NP
IOB tag, meaning that most nouns occur within noun phrases.
As with the other commands, you can pass in a custom corpus path and reader to analyze your own chunked corpus.
train_chunker.py
script