Now that we can extract features from text, we can train a classifier. The easiest classifier to get started with is the
NaiveBayesClassifier
class. It uses the
Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is:
P(label | features) = P(label) * P(features | label) / P(features)
The following list describes the various parameters from the previous formula:
P(label)
: This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%.P(features | label)
: This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data.P(features)
: This is the prior probability of a given feature set occurring. This is the likelihood of a random feature set being the same as the given feature set, and is based on the observed feature sets in the training data. For example, if the given feature set occurs twice in 100 training instances, the prior probability is 2%.P(label | features)
: This tells us the probability that the given features should have that label. If this value is high, then we can be reasonably confident that the label is correct for the given features.We are going to be using the movie_reviews
corpus for our initial classification examples. This corpus contains two categories of text: pos
and neg
. These categories are exclusive, which makes a classifier trained on them a
binary classifier. Binary classifiers have only two classification labels, and will always choose one or the other.
Each file in the movie_reviews
corpus is composed of either positive or negative movie reviews. We will be using each file as a single instance for both training and testing the classifier. Because of the nature of the text and its categories, the classification we will be doing is a form of sentiment analysis. If the classifier returns pos
, then the text expresses a positive sentiment, whereas if we get neg
, then the text expresses a negative sentiment.
For training, we need to first create a list of labeled feature sets. This list should be of the form [(featureset, label)]
, where the featureset
variable is a dict
and label
is the known class label for the featureset
. The label_feats_from_corpus()
function in featx.py
takes a corpus, such as movie_reviews
, and a feature_detector
function, which defaults to bag_of_words
. It then constructs and returns a mapping of the form {label: [featureset]}
. We can use this mapping to create a list of labeled training instances and testing instances. The reason to do it this way is to get a fair sample from each label. It is important to get a fair sample, because parts of the corpus may be (unintentionally) biased towards one label or the other. Getting a fair sample should eliminate this possible bias:
import collections def label_feats_from_corpus(corp, feature_detector=bag_of_words): label_feats = collections.defaultdict(list) for label in corp.categories(): for fileid in corp.fileids(categories=[label]): feats = feature_detector(corp.words(fileids=[fileid])) label_feats[label].append(feats) return label_feats
Once we can get a mapping of label | feature
sets
, we want to construct a list of labeled training instances and testing instances. The
split_label_feats()
function in featx.py
takes a mapping returned from label_feats_from_corpus()
and splits each list of feature sets into labeled training and testing instances:
def split_label_feats(lfeats, split=0.75): train_feats = [] test_feats = [] for label, feats in lfeats.items(): cutoff = int(len(feats) * split) train_feats.extend([(feat, label) for feat in feats[:cutoff]]) test_feats.extend([(feat, label) for feat in feats[cutoff:]]) return train_feats, test_feats
Using these functions with the movie_reviews
corpus gives us the lists of labeled feature sets we need to train and test a classifier:
>>> from nltk.corpus import movie_reviews >>> from featx import label_feats_from_corpus, split_label_feats >>> movie_reviews.categories() ['neg', 'pos'] >>> lfeats = label_feats_from_corpus(movie_reviews) >>> lfeats.keys() dict_keys(['neg', 'pos']) >>> train_feats, test_feats = split_label_feats(lfeats, split=0.75) >>> len(train_feats) 1500 >>> len(test_feats) 500
So there are 1000 pos
files, 1000 neg
files, and we end up with 1500 labeled training instances and 500 labeled testing instances, each composed of equal parts of pos
and neg
. If we were using a different dataset, where the classes were not balanced, our training and testing data would have the same imbalance.
Now we can train a NaiveBayesClassifier
class using its train()
class method:
>>> from nltk.classify import NaiveBayesClassifier >>> nb_classifier = NaiveBayesClassifier.train(train_feats) >>> nb_classifier.labels() ['neg', 'pos']
Let's test the classifier on a couple of made up reviews. The classify()
method takes a single argument, which should be a feature set. We can use the same bag_of_words()
feature detector on a list of words to get our feature set:
>>> from featx import bag_of_words >>> negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous']) >>> nb_classifier.classify(negfeat) 'neg' >>> posfeat = bag_of_words(['kate', 'winslet', 'is', 'accessible']) >>> nb_classifier.classify(posfeat) 'pos'
The label_feats_from_corpus()
function assumes that the corpus is categorized, and that a single file represents a single instance for feature extraction. It iterates over each category label, and extracts features from each file in that category using the feature_detector()
function, which defaults to bag_of_words()
. It returns a dict
whose keys are the category labels, and the values are lists of instances for that category.
If we had label_feats_from_corpus()
return a list of labeled feature sets instead of a dict
, it would be much harder to get balanced training data. The list would be ordered by label, and if you took a slice of it, you would almost certainly be getting far more of one label than another. By returning a dict
, you can take slices from the feature sets of each label, in the same proportion that exists in the data.
Now we need to split the labeled feature sets into training and testing instances using split_label_feats()
. This function allows us to take a fair sample of labeled feature sets from each label, using the split
keyword argument to determine the size of the sample. The split
argument defaults to 0.75
, which means the first 75% of the labeled feature sets for each label will be used for training, and the remaining 25% will be used for testing.
Once we have gotten our training and testing feats split up, we train a classifier using the NaiveBayesClassifier.train()
method. This class method builds two probability distributions for calculating prior probabilities. These are passed into the
NaiveBayesClassifier
constructor. The label_probdist
constructor contains the prior probability for each label, or P(label)
. The feature_probdist
constructor contains P(feature name = feature value | label)
. In our case, it will store P(word=True | label)
. Both are calculated based on the frequency of occurrence of each label and each feature name and value in the training data.
The NaiveBayesClassifier
class inherits from ClassifierI
, which requires subclasses to provide a labels()
method, and at least one of the classify()
or prob_classify()
methods. The following diagram shows other methods, which will be covered shortly:
We can test the accuracy of the classifier using nltk.classify.util.accuracy()
and the test_feats
variable created previously:
>>> from nltk.classify.util import accuracy >>> accuracy(nb_classifier, test_feats) 0.728
This tells us that the classifier correctly guessed the label of nearly 73% of the test feature sets.
While the classify()
method returns only a single label, you can use the prob_classify()
method to get the classification probability of each label. This can be useful if you want to use probability thresholds for classification:
>>> probs = nb_classifier.prob_classify(test_feats[0][0]) >>> probs.samples() dict_keys(['neg', 'pos']) >>> probs.max() 'pos' >>> probs.prob('pos') 0.9999999646430913 >>> probs.prob('neg') 3.535688969240647e-08
In this case, the classifier says that the first test instance is nearly 100% likely to be pos
. Other instances may have more mixed probabilities. For example, if the classifier says an instance is 60% pos
and 40% neg
, that means the classifier is 60% sure the instance is pos
, but there is a 40% chance that it is neg
. It can be useful to know this for situations where you only want to use strongly classified instances, with a threshold of 80% or greater.
The NaiveBayesClassifier
class has two methods that are quite useful for learning about your data. Both methods take a keyword argument n
to control how many results to show. The most_informative_features()
method returns a list of the form [(feature name, feature value)]
ordered by most informative to least informative. In our case, the feature value will always be True
:
>>> nb_classifier.most_informative_features(n=5) [('magnificent', True), ('outstanding', True), ('insulting', True), ('vulnerable', True), ('ludicrous', True)]
The show_most_informative_features()
method will print out the results from most_informative_features()
and will also include the probability of a feature pair belonging to each label:
>>> nb_classifier.show_most_informative_features(n=5) Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0
The informativeness, or information gain, of each feature pair is based on the prior probability of the feature pair occurring for each label. More informative features are those that occur primarily in one label and not on the other. The less informative features are those that occur frequently with both labels. Another way to state this is that the entropy of the classifier decreases more when using a more informative feature. See https://en.wikipedia.org/wiki/Information_gain_in_decision_trees for more on information gain and entropy (while it specifically mentions decision trees, the same concepts are applicable to all classifiers).
During training, the NaiveBayesClassifier
class constructs probability distributions for each feature using an estimator
parameter, which defaults to nltk.probability.ELEProbDist
. The estimator is used to calculate the probability of a label
parameter given a specific feature. In
ELEProbDist
, ELE stands for Expected Likelihood Estimate, and the formula for calculating the label probabilities for a given feature is (c+0.5)/(N+B/2). Here, c is the count of times a single feature occurs, N
is the total number of feature outcomes observed, and B is the number of bins or unique features in the feature set. In cases where the feature values are all True
, N == B. In other cases, where the number of times a feature occurs is recorded, then N >= B.
You can use any estimator
parameter you want, and there are quite a few to choose from. The only constraints are that it must inherit from nltk.probability.ProbDistI
and its constructor must take a bins
keyword argument. Here's an example using the LaplaceProdDist
class, which uses the formula (c+1)/(N+B):
>>> from nltk.probability import LaplaceProbDist >>> nb_classifier = NaiveBayesClassifier.train(train_feats, estimator=LaplaceProbDist) >>> accuracy(nb_classifier, test_feats) 0.716
As you can see, accuracy is slightly lower, so choose your estimator
parameter carefully.
You cannot use nltk.probability.MLEProbDist
as the estimator, or any ProbDistI
subclass that does not take the bins
keyword argument. Training will fail with TypeError: __init__() got an unexpected keyword argument 'bins'
.
You don't have to use the train()
class method to construct a NaiveBayesClassifier
. You can instead create the label_probdist
and feature_probdist
variables manually. The label_probdist
variable should be an instance of ProbDistI
, and should contain the prior probabilities for each label. The feature_probdist
variable should be a dict
whose keys are tuples of the form (label, feature name)
and whose values are instances of ProbDistI
that have the probabilities for each feature value. In our case, each ProbDistI
should have only one value, True=1
. Here's a very simple example using a manually constructed DictionaryProbDist
class:
>>> from nltk.probability import DictionaryProbDist >>> label_probdist = DictionaryProbDist({'pos': 0.5, 'neg': 0.5}) >>> true_probdist = DictionaryProbDist({True: 1}) >>> feature_probdist = {('pos', 'yes'): true_probdist, ('neg', 'no'): true_probdist} >>> classifier = NaiveBayesClassifier(label_probdist, feature_probdist) >>> classifier.classify({'yes': True}) 'pos' >>> classifier.classify({'no': True}) 'neg'
In the next recipes, we will train two more classifiers, DecisionTreeClassifier
and MaxentClassifier
. In the Measuring precision and recall of a classifier recipe in this chapter, we will use precision and recall instead of accuracy to evaluate the classifiers. And then in the Calculating high information words recipe, we will see how using only the most informative features can improve classifier performance.
The movie_reviews
corpus is an instance of CategorizedPlaintextCorpusReader
, which is covered in the Creating a categorized text corpus recipe in Chapter 3, Creating Custom Corpora.