Training a classifier with NLTK-Trainer

In this recipe, we'll cover the script from NLTK-Trainer, which lets you train NLTK classifiers from the command line. NLTK-Trainer was previously introduced at the end of Chapter 4, Part-of-speech Tagging, and again at the end of Chapter 5, Extracting Chunks.


You can find NLTK-Trainer at and the online documentation at

How to do it...

Like and, the only required argument for is the name of a corpus. The corpus must have a categories() method, because text classification is all about learning to classify categories. Here's an example of running on the movie_reviews corpus:

$ python movie_reviews
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0.967000
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
dumping NaiveBayesClassifier to ~/nltk_data/classifiers/movie_reviews_NaiveBayes.pickle

We can use the --no-pickle argument to skip saving the classifier and the --fraction argument to limit the training set and evaluate the classifier against a test set. This example replicates what we did earlier in the Training a Naive Bayes classifier recipe.

$ python movie_reviews --no-pickle --fraction 0.75
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800

You can see that not only do we get accuracy, we also get the precision and recall of each class, like we covered earlier in the recipe, Measuring precision and recall of a classifier.


The PYTHONHASHSEED environment variable has been omitted for clarity. This means that when you run, your accuracy, precision, and recall values may vary. To get consistent values, run like this:

$ PYTHONHASHSEED=0 python movie_reviews

How it works...

The script goes through a series of steps to train a classifier:

  1. Loads the categorized corpus.
  2. Extracts features.
  3. Trains the classifier.

Depending on the arguments used, there may be further steps, such as evaluating the classifier and/or saving the classifier.

The default feature extraction is a bag of words, which we covered in the first recipe of this chapter, Bag of words feature extraction. And the default classifier is the NaiveBayesClassifier class, which we covered earlier in the Training a Naive Bayes classifier recipe. You can choose a different classifier using the --classifier argument. Here's an example with DecisionTreeClassifier, replicating the same arguments we used in the Training a decision tree classifier recipe:

$ python movie_reviews --no-pickle --fraction 0.75 --classifier DecisionTree --trace 0 --entropy_cutoff 0.8 --depth_cutoff 5 --support_cutoff 30 --binary
accuracy: 0.672000
neg precision: 0.683761
neg recall: 0.640000
neg f-measure: 0.661157
pos precision: 0.661654
pos recall: 0.704000
pos f-measure: 0.682171

There's more...

The script supports many other arguments not shown here, all of which you can see by running the script with --help. Some additional arguments are presented next along with examples for other classification algorithms, followed by an introduction to another classification-related script available in nltk-trainer.

Saving a pickled classifier

Without the --no-pickle argument, will save a pickled classifier at ~/nltk_data/classifiers/NAME.pickle, where NAME is a combination of the corpus name and training algorithm. You can specify a custom filename for your classifier using the --filename argument like this:

$ python movie_reviews --filename path/to/classifier.pickle

Using different training instances

By default, uses individual files as training instances. That means a single categorized file will be used as one instance. But you can instead use paragraphs or sentences as training instances. Here's an example using sentences from the movie_reviews corpus:

$ python movie_reviews --no-pickle --fraction 0.75 --instances sents
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
50820 training feats, 16938 testing feats
training NaiveBayes classifier
accuracy: 0.638623
neg precision: 0.694942
neg recall: 0.470786
neg f-measure: 0.561313
pos precision: 0.610546
pos recall: 0.800580
pos f-measure: 0.692767

To use paragraphs instead of files or sentences, you can do --instances paras.

The most informative features

In the earlier recipe, Training a Naive Bayes classifier, we covered how to see the most informative features. This can also be done as an argument in

$ python movie_reviews --no-pickle --fraction 0.75 --show-most-informative 5
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
5 most informative features

Most Informative Features
             finest = True              pos : neg    =     13.4 : 1.0
         astounding = True              pos : neg    =     11.0 : 1.0
             avoids = True              pos : neg    =     11.0 : 1.0
             inject = True              neg : pos    =     10.3 : 1.0
          strongest = True              pos : neg    =     10.3 : 1.0

The Maxent and LogisticRegression classifiers

In the Training a maximum entropy classifier recipe, we covered the MaxentClassifier class with the GIS algorithm. Here's how to use to do this:

$ python movie_reviews --no-pickle --fraction 0.75 --classifier GIS --max_iter 10 --min_lldelta 0.5
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training GIS classifier
  ==> Training (10 iterations)
accuracy: 0.712000
neg precision: 0.964912
neg recall: 0.440000
neg f-measure: 0.604396
pos precision: 0.637306
pos recall: 0.984000
pos f-measure: 0.773585

If you have scikit-learn installed, then you can use many different sklearn algorithms for classification. In the Training scikit-learn classifiers recipe, we covered the LogisticRegression classifier, so here's how to do it with

$ python movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.LogisticRegression
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training sklearn.LogisticRegression with {'penalty': 'l2', 'C': 1.0}
using dtype bool
training sklearn.LogisticRegression classifier
accuracy: 0.856000
neg precision: 0.847656
neg recall: 0.868000
neg f-measure: 0.857708
pos precision: 0.864754
pos recall: 0.844000
pos f-measure: 0.854251


SVM classifiers were introduced in the Training scikit-learn classifiers recipe, and can also be used with Here's the parameters for LinearSVC:

$ python movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.LinearSVC
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training sklearn.LinearSVC with {'penalty': 'l2', 'loss': 'l2', 'C': 1.0}
using dtype bool
training sklearn.LinearSVC classifier
accuracy: 0.860000
neg precision: 0.851562
neg recall: 0.872000
neg f-measure: 0.861660
pos precision: 0.868852
pos recall: 0.848000
pos f-measure: 0.858300

And here's the parameters for NuSVC:

$ python movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.NuSVC
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training sklearn.NuSVC with {'kernel': 'rbf', 'nu': 0.5}
using dtype bool
training sklearn.NuSVC classifier
accuracy: 0.850000
neg precision: 0.827715
neg recall: 0.884000
neg f-measure: 0.854932
pos precision: 0.875536
pos recall: 0.816000
pos f-measure: 0.844720

Combining classifiers

In the Combining classifiers with voting recipe, we covered how to combine multiple classifiers into a single classifier using a max vote method. The script can also combine classifiers, but it uses a slightly different algorithm. Instead of counting votes, it sums probabilities together to produce a final probability distribution, which is then used to classify each instance. Here's an example with three sklearn classifiers:

$ python movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.LogisticRegression sklearn.MultinomialNB sklearn.NuSVC
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
1500 training feats, 500 testing feats
training sklearn.LogisticRegression with {'penalty': 'l2', 'C': 1.0}
using dtype bool
training sklearn.MultinomialNB with {'alpha': 1.0}
using dtype bool
training sklearn.NuSVC with {'kernel': 'rbf', 'nu': 0.5}
using dtype bool
training sklearn.LogisticRegression classifier
training sklearn.MultinomialNB classifier
training sklearn.NuSVC classifier
accuracy: 0.856000
neg precision: 0.839695
neg recall: 0.880000
neg f-measure: 0.859375
pos precision: 0.873950
pos recall: 0.832000
pos f-measure: 0.852459

High information words and bigrams

In the Calculating high information words recipe, we calculated the information gain of words, and then used only words with high information gain as features. The script can do this too:

$ python movie_reviews --no-pickle --fraction 0.75 --classifier NaiveBayes --min_score 5 --ngrams 1 2
loading movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
using bag of words from known set feature extraction
9989 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training NaiveBayes classifier
accuracy: 0.860000
neg precision: 0.901786
neg recall: 0.808000
neg f-measure: 0.852321
pos precision: 0.826087
pos recall: 0.912000
pos f-measure: 0.866920

Cross-fold validation

Cross-fold validation is a method for evaluating a classification algorithm. The typical way to do it is using 10 folds, leaving one fold out for testing. What this means is that the training corpus is first split into 10 parts (or folds). Then, it is trained on nine of the folds and tested against the remaining fold. This is repeated nine more times, choosing a different fold to leave out for testing each time. By using a different set of training and testing examples each time, you can avoid any bias that might be present in the training set. Here's how to do this with

$ python movie_reviews --classifier sklearn.LogisticRegression --cross-fold 10

mean and variance across folds
accuracy mean: 0.870000
accuracy variance: 0.000365
neg precision mean: 0.866884
neg precision variance: 0.000795
pos precision mean: 0.873236
pos precision variance: 0.001157
neg recall mean: 0.875482
neg recall variance: 0.000706
pos recall mean: 0.864537
pos recall variance: 0.001091
neg f_measure mean: 0.870630
neg f_measure variance: 0.000290
pos f_measure mean: 0.868246
pos f_measure variance: 0.000610

Most of the output has been omitted for clarity. What really matters is the final evaluation, which is the mean and variance of the results across all folds.

Analyzing a classifier

Also included in NLTK-Trainer is a script called As the name implies, you can use it to see how a classifier categorizes a given corpus. It expects the name of a corpus and a path to a pickled classifier to run on the corpus. If the corpus is categorized, you can also use the --metrics argument to get the accuracy, precision, and recall. The script supports many of the same corpus-related arguments as, and also has an optional --speed argument, so you can see how fast the classifier is. Here's an example of analyzing a pickled NaiveBayesClassifier class against the movie_reviews corpus:

$ python movie_reviews --classifier classifiers/movie_reviews_NaiveBayes.pickle --metrics --speed
loading time: 0secs
accuracy: 0.967
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
neg 934
pos 1066
average time per classify: 3secs / 2000 feats = 1.905661 ms/feat

See also

NLTK-Trainer was introduced at the end of Chapter 4, Part-of-speech Tagging, in the Training a tagger with NLTK-Trainer recipe. It was also covered at the end of Chapter 5, Extracting Chunks, in the Training a chunker with NLTK-Trainer recipe. All the previous recipes in the chapter explain various aspects of how the script works.

