In this recipe, we'll cover the train_classifier.py
script from NLTK-Trainer, which lets you train NLTK classifiers from the command line. NLTK-Trainer was previously introduced at the end of Chapter 4, Part-of-speech Tagging, and again at the end of Chapter 5, Extracting Chunks.
You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.
Like train_tagger.py
and train_chunker.py
, the only required argument for train_classifier.py
is the name of a corpus. The corpus must have a categories()
method, because text classification is all about learning to classify categories. Here's an example of running train_classifier.py
on the movie_reviews
corpus:
$ python train_classifier.py movie_reviews loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 2000 training feats, 2000 testing feats training NaiveBayes classifier accuracy: 0.967000 neg precision: 1.000000 neg recall: 0.934000 neg f-measure: 0.965874 pos precision: 0.938086 pos recall: 1.000000 pos f-measure: 0.968054 dumping NaiveBayesClassifier to ~/nltk_data/classifiers/movie_reviews_NaiveBayes.pickle
We can use the --no-pickle
argument to skip saving the classifier and the --fraction
argument to limit the training set and evaluate the classifier against a test set. This example replicates what we did earlier in the Training a Naive Bayes classifier recipe.
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training NaiveBayes classifier accuracy: 0.726000 neg precision: 0.952000 neg recall: 0.476000 neg f-measure: 0.634667 pos precision: 0.650667 pos recall: 0.976000 pos f-measure: 0.780800
You can see that not only do we get accuracy, we also get the precision and recall of each class, like we covered earlier in the recipe, Measuring precision and recall of a classifier.
The
train_classifier.py
script goes through a series of steps to train a classifier:
Depending on the arguments used, there may be further steps, such as evaluating the classifier and/or saving the classifier.
The default feature extraction is a bag of words, which we covered in the first recipe of this chapter, Bag of words feature extraction. And the default classifier is the NaiveBayesClassifier
class, which we covered earlier in the Training a Naive Bayes classifier recipe. You can choose a different classifier using the --classifier
argument. Here's an example with DecisionTreeClassifier
, replicating the same arguments we used in the Training a decision tree classifier recipe:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier DecisionTree --trace 0 --entropy_cutoff 0.8 --depth_cutoff 5 --support_cutoff 30 --binary accuracy: 0.672000 neg precision: 0.683761 neg recall: 0.640000 neg f-measure: 0.661157 pos precision: 0.661654 pos recall: 0.704000 pos f-measure: 0.682171
The train_classifier.py
script supports many other arguments not shown here, all of which you can see by running the script with --help
. Some additional arguments are presented next along with examples for other classification algorithms, followed by an introduction to another classification-related script available in nltk-trainer
.
Without the --no-pickle
argument, train_classifier.py
will save a pickled classifier at ~/nltk_data/classifiers/NAME.pickle
, where NAME
is a combination of the corpus name and training algorithm. You can specify a custom filename for your classifier using the --filename
argument like this:
$ python train_classifier.py movie_reviews --filename path/to/classifier.pickle
By default, train_classifier.py
uses individual files as training instances. That means a single categorized file will be used as one instance. But you can instead use paragraphs or sentences as training instances. Here's an example using sentences from the movie_reviews
corpus:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --instances sents loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 50820 training feats, 16938 testing feats training NaiveBayes classifier accuracy: 0.638623 neg precision: 0.694942 neg recall: 0.470786 neg f-measure: 0.561313 pos precision: 0.610546 pos recall: 0.800580 pos f-measure: 0.692767
To use paragraphs instead of files or sentences, you can do --instances paras
.
In the earlier recipe, Training a Naive Bayes classifier, we covered how to see the most informative features. This can also be done as an argument in train_classifier.py
:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --show-most-informative 5 loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training NaiveBayes classifier accuracy: 0.726000 neg precision: 0.952000 neg recall: 0.476000 neg f-measure: 0.634667 pos precision: 0.650667 pos recall: 0.976000 pos f-measure: 0.780800 5 most informative features Most Informative Features finest = True pos : neg = 13.4 : 1.0 astounding = True pos : neg = 11.0 : 1.0 avoids = True pos : neg = 11.0 : 1.0 inject = True neg : pos = 10.3 : 1.0 strongest = True pos : neg = 10.3 : 1.0
In the Training a maximum entropy classifier recipe, we covered the MaxentClassifier
class with the GIS algorithm. Here's how to use train_classifier.py
to do this:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier GIS --max_iter 10 --min_lldelta 0.5 loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training GIS classifier ==> Training (10 iterations) accuracy: 0.712000 neg precision: 0.964912 neg recall: 0.440000 neg f-measure: 0.604396 pos precision: 0.637306 pos recall: 0.984000 pos f-measure: 0.773585
If you have scikit-learn
installed, then you can use many different sklearn algorithms for classification. In the Training scikit-learn classifiers recipe, we covered the LogisticRegression
classifier, so here's how to do it with train_classifier.py
:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.LogisticRegression loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training sklearn.LogisticRegression with {'penalty': 'l2', 'C': 1.0} using dtype bool training sklearn.LogisticRegression classifier accuracy: 0.856000 neg precision: 0.847656 neg recall: 0.868000 neg f-measure: 0.857708 pos precision: 0.864754 pos recall: 0.844000 pos f-measure: 0.854251
SVM classifiers were introduced in the Training scikit-learn classifiers recipe, and can also be used with train_classifier.py
. Here's the parameters for LinearSVC
:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.LinearSVC loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training sklearn.LinearSVC with {'penalty': 'l2', 'loss': 'l2', 'C': 1.0} using dtype bool training sklearn.LinearSVC classifier accuracy: 0.860000 neg precision: 0.851562 neg recall: 0.872000 neg f-measure: 0.861660 pos precision: 0.868852 pos recall: 0.848000 pos f-measure: 0.858300
And here's the parameters for NuSVC
:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.NuSVC loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training sklearn.NuSVC with {'kernel': 'rbf', 'nu': 0.5} using dtype bool training sklearn.NuSVC classifier accuracy: 0.850000 neg precision: 0.827715 neg recall: 0.884000 neg f-measure: 0.854932 pos precision: 0.875536 pos recall: 0.816000 pos f-measure: 0.844720
In the Combining classifiers with voting recipe, we covered how to combine multiple classifiers into a single classifier using a max vote method. The train_classifier.py
script can also combine classifiers, but it uses a slightly different algorithm. Instead of counting votes, it sums probabilities together to produce a final probability distribution, which is then used to classify each instance. Here's an example with three sklearn classifiers:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier sklearn.LogisticRegression sklearn.MultinomialNB sklearn.NuSVC loading movie_reviews 2 labels: ['neg', 'pos'] using bag of words feature extraction 1500 training feats, 500 testing feats training sklearn.LogisticRegression with {'penalty': 'l2', 'C': 1.0} using dtype bool training sklearn.MultinomialNB with {'alpha': 1.0} using dtype bool training sklearn.NuSVC with {'kernel': 'rbf', 'nu': 0.5} using dtype bool training sklearn.LogisticRegression classifier training sklearn.MultinomialNB classifier training sklearn.NuSVC classifier accuracy: 0.856000 neg precision: 0.839695 neg recall: 0.880000 neg f-measure: 0.859375 pos precision: 0.873950 pos recall: 0.832000 pos f-measure: 0.852459
In the Calculating high information words recipe, we calculated the information gain of words, and then used only words with high information gain as features. The train_classifier.py
script can do this too:
$ python train_classifier.py movie_reviews --no-pickle --fraction 0.75 --classifier NaiveBayes --min_score 5 --ngrams 1 2 loading movie_reviews 2 labels: ['neg', 'pos'] calculating word scores using bag of words from known set feature extraction 9989 words meet min_score and/or max_feats 1500 training feats, 500 testing feats training NaiveBayes classifier accuracy: 0.860000 neg precision: 0.901786 neg recall: 0.808000 neg f-measure: 0.852321 pos precision: 0.826087 pos recall: 0.912000 pos f-measure: 0.866920
Cross-fold validation is a method for evaluating a classification algorithm. The typical way to do it is using 10 folds, leaving one fold out for testing. What this means is that the training corpus is first split into 10 parts (or folds). Then, it is trained on nine of the folds and tested against the remaining fold. This is repeated nine more times, choosing a different fold to leave out for testing each time. By using a different set of training and testing examples each time, you can avoid any bias that might be present in the training set. Here's how to do this with train_classifier.py
:
$ python train_classifier.py movie_reviews --classifier sklearn.LogisticRegression --cross-fold 10 … mean and variance across folds ------------------------------ accuracy mean: 0.870000 accuracy variance: 0.000365 neg precision mean: 0.866884 neg precision variance: 0.000795 pos precision mean: 0.873236 pos precision variance: 0.001157 neg recall mean: 0.875482 neg recall variance: 0.000706 pos recall mean: 0.864537 pos recall variance: 0.001091 neg f_measure mean: 0.870630 neg f_measure variance: 0.000290 pos f_measure mean: 0.868246 pos f_measure variance: 0.000610
Most of the output has been omitted for clarity. What really matters is the final evaluation, which is the mean and variance of the results across all folds.
Also included in NLTK-Trainer is a script called analyze_classifier_coverage.py
. As the name implies, you can use it to see how a classifier categorizes a given corpus. It expects the name of a corpus and a path to a pickled classifier to run on the corpus. If the corpus is categorized, you can also use the --metrics
argument to get the accuracy, precision, and recall. The script supports many of the same corpus-related arguments as train_classifier.py
, and also has an optional --speed
argument, so you can see how fast the classifier is. Here's an example of analyzing a pickled NaiveBayesClassifier
class against the movie_reviews
corpus:
$ python analyze_classifier_coverage.py movie_reviews --classifier classifiers/movie_reviews_NaiveBayes.pickle --metrics --speed loading time: 0secs accuracy: 0.967 neg precision: 1.000000 neg recall: 0.934000 neg f-measure: 0.965874 pos precision: 0.938086 pos recall: 1.000000 pos f-measure: 0.968054 neg 934 pos 1066 average time per classify: 3secs / 2000 feats = 1.905661 ms/feat
NLTK-Trainer was introduced at the end of Chapter 4, Part-of-speech Tagging, in the Training a tagger with NLTK-Trainer recipe. It was also covered at the end of Chapter 5, Extracting Chunks, in the Training a chunker with NLTK-Trainer recipe. All the previous recipes in the chapter explain various aspects of how the train_classifier.py
script works.