Training a Naive Bayes classifier

Now that we can extract features from text, we can train a classifier. The easiest classifier to get started with is the NaiveBayesClassifier class. It uses the Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is:

P(label | features) = P(label) * P(features | label) / P(features)

The following list describes the various parameters from the previous formula:

  • P(label): This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%.
  • P(features | label): This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data.
  • P(features): This is the prior probability of a given feature set occurring. This is the likelihood of a random feature set being the same as the given feature set, and is based on the observed feature sets in the training data. For example, if the given feature set occurs twice in 100 training instances, the prior probability is 2%.
  • P(label | features): This tells us the probability that the given features should have that label. If this value is high, then we can be reasonably confident that the label is correct for the given features.

Getting ready

We are going to be using the movie_reviews corpus for our initial classification examples. This corpus contains two categories of text: pos and neg. These categories are exclusive, which makes a classifier trained on them a binary classifier. Binary classifiers have only two classification labels, and will always choose one or the other.

Each file in the movie_reviews corpus is composed of either positive or negative movie reviews. We will be using each file as a single instance for both training and testing the classifier. Because of the nature of the text and its categories, the classification we will be doing is a form of sentiment analysis. If the classifier returns pos, then the text expresses a positive sentiment, whereas if we get neg, then the text expresses a negative sentiment.

How to do it...

For training, we need to first create a list of labeled feature sets. This list should be of the form [(featureset, label)], where the featureset variable is a dict and label is the known class label for the featureset. The label_feats_from_corpus() function in takes a corpus, such as movie_reviews, and a feature_detector function, which defaults to bag_of_words. It then constructs and returns a mapping of the form {label: [featureset]}. We can use this mapping to create a list of labeled training instances and testing instances. The reason to do it this way is to get a fair sample from each label. It is important to get a fair sample, because parts of the corpus may be (unintentionally) biased towards one label or the other. Getting a fair sample should eliminate this possible bias:

import collections

def label_feats_from_corpus(corp, feature_detector=bag_of_words):
  label_feats = collections.defaultdict(list)
  for label in corp.categories():
    for fileid in corp.fileids(categories=[label]):
      feats = feature_detector(corp.words(fileids=[fileid]))
  return label_feats

Once we can get a mapping of label | feature sets, we want to construct a list of labeled training instances and testing instances. The split_label_feats() function in takes a mapping returned from label_feats_from_corpus() and splits each list of feature sets into labeled training and testing instances:

def split_label_feats(lfeats, split=0.75):
  train_feats = []
  test_feats = []
  for label, feats in lfeats.items():
    cutoff = int(len(feats) * split)
    train_feats.extend([(feat, label) for feat in feats[:cutoff]])
    test_feats.extend([(feat, label) for feat in feats[cutoff:]])
  return train_feats, test_feats

Using these functions with the movie_reviews corpus gives us the lists of labeled feature sets we need to train and test a classifier:

>>> from nltk.corpus import movie_reviews
>>> from featx import label_feats_from_corpus, split_label_feats
>>> movie_reviews.categories()
['neg', 'pos']
>>> lfeats = label_feats_from_corpus(movie_reviews)
>>> lfeats.keys()
dict_keys(['neg', 'pos'])
>>> train_feats, test_feats = split_label_feats(lfeats, split=0.75)
>>> len(train_feats)
>>> len(test_feats)

So there are 1000 pos files, 1000 neg files, and we end up with 1500 labeled training instances and 500 labeled testing instances, each composed of equal parts of pos and neg. If we were using a different dataset, where the classes were not balanced, our training and testing data would have the same imbalance.

Now we can train a NaiveBayesClassifier class using its train() class method:

>>> from nltk.classify import NaiveBayesClassifier
>>> nb_classifier = NaiveBayesClassifier.train(train_feats)
>>> nb_classifier.labels()
['neg', 'pos']

Let's test the classifier on a couple of made up reviews. The classify() method takes a single argument, which should be a feature set. We can use the same bag_of_words() feature detector on a list of words to get our feature set:

>>> from featx import bag_of_words
>>> negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous'])
>>> nb_classifier.classify(negfeat)
>>> posfeat = bag_of_words(['kate', 'winslet', 'is', 'accessible'])
>>> nb_classifier.classify(posfeat)

How it works...

The label_feats_from_corpus() function assumes that the corpus is categorized, and that a single file represents a single instance for feature extraction. It iterates over each category label, and extracts features from each file in that category using the feature_detector() function, which defaults to bag_of_words(). It returns a dict whose keys are the category labels, and the values are lists of instances for that category.

If we had label_feats_from_corpus() return a list of labeled feature sets instead of a dict, it would be much harder to get balanced training data. The list would be ordered by label, and if you took a slice of it, you would almost certainly be getting far more of one label than another. By returning a dict, you can take slices from the feature sets of each label, in the same proportion that exists in the data.

Now we need to split the labeled feature sets into training and testing instances using split_label_feats(). This function allows us to take a fair sample of labeled feature sets from each label, using the split keyword argument to determine the size of the sample. The split argument defaults to 0.75, which means the first 75% of the labeled feature sets for each label will be used for training, and the remaining 25% will be used for testing.

Once we have gotten our training and testing feats split up, we train a classifier using the NaiveBayesClassifier.train() method. This class method builds two probability distributions for calculating prior probabilities. These are passed into the NaiveBayesClassifier constructor. The label_probdist constructor contains the prior probability for each label, or P(label). The feature_probdist constructor contains P(feature name = feature value | label). In our case, it will store P(word=True | label). Both are calculated based on the frequency of occurrence of each label and each feature name and value in the training data.

The NaiveBayesClassifier class inherits from ClassifierI, which requires subclasses to provide a labels() method, and at least one of the classify() or prob_classify() methods. The following diagram shows other methods, which will be covered shortly:

There's more...

We can test the accuracy of the classifier using nltk.classify.util.accuracy() and the test_feats variable created previously:

>>> from nltk.classify.util import accuracy
>>> accuracy(nb_classifier, test_feats)

This tells us that the classifier correctly guessed the label of nearly 73% of the test feature sets.


The code in this chapter is run with the PYTHONHASHSEED=0 environment variable so that accuracy calculations are consistent. If you run the code with a different value for PYTHONHASHSEED, or without setting this environment variable, your accuracy values may differ.

Classification probability

While the classify() method returns only a single label, you can use the prob_classify() method to get the classification probability of each label. This can be useful if you want to use probability thresholds for classification:

>>> probs = nb_classifier.prob_classify(test_feats[0][0])
>>> probs.samples()
dict_keys(['neg', 'pos'])
>>> probs.max()
>>> probs.prob('pos')
>>> probs.prob('neg')

In this case, the classifier says that the first test instance is nearly 100% likely to be pos. Other instances may have more mixed probabilities. For example, if the classifier says an instance is 60% pos and 40% neg, that means the classifier is 60% sure the instance is pos, but there is a 40% chance that it is neg. It can be useful to know this for situations where you only want to use strongly classified instances, with a threshold of 80% or greater.

Most informative features

The NaiveBayesClassifier class has two methods that are quite useful for learning about your data. Both methods take a keyword argument n to control how many results to show. The most_informative_features() method returns a list of the form [(feature name, feature value)] ordered by most informative to least informative. In our case, the feature value will always be True:

>>> nb_classifier.most_informative_features(n=5)
[('magnificent', True), ('outstanding', True), ('insulting', True), ('vulnerable', True), ('ludicrous', True)]

The show_most_informative_features() method will print out the results from most_informative_features() and will also include the probability of a feature pair belonging to each label:

>>> nb_classifier.show_most_informative_features(n=5)
Most Informative Features

    magnificent = True    pos : neg = 15.0 : 1.0
    outstanding = True    pos : neg = 13.6 : 1.0
    insulting = True      neg : pos = 13.0 : 1.0
    vulnerable = True     pos : neg = 12.3 : 1.0
    ludicrous = True      neg : pos = 11.8 : 1.0

The informativeness, or information gain, of each feature pair is based on the prior probability of the feature pair occurring for each label. More informative features are those that occur primarily in one label and not on the other. The less informative features are those that occur frequently with both labels. Another way to state this is that the entropy of the classifier decreases more when using a more informative feature. See for more on information gain and entropy (while it specifically mentions decision trees, the same concepts are applicable to all classifiers).

Training estimator

During training, the NaiveBayesClassifier class constructs probability distributions for each feature using an estimator parameter, which defaults to nltk.probability.ELEProbDist. The estimator is used to calculate the probability of a label parameter given a specific feature. In ELEProbDist, ELE stands for Expected Likelihood Estimate, and the formula for calculating the label probabilities for a given feature is (c+0.5)/(N+B/2). Here, c is the count of times a single feature occurs, N

is the total number of feature outcomes observed, and B is the number of bins or unique features in the feature set. In cases where the feature values are all True, N == B. In other cases, where the number of times a feature occurs is recorded, then N >= B.

You can use any estimator parameter you want, and there are quite a few to choose from. The only constraints are that it must inherit from nltk.probability.ProbDistI and its constructor must take a bins keyword argument. Here's an example using the LaplaceProdDist class, which uses the formula (c+1)/(N+B):

>>> from nltk.probability import LaplaceProbDist
>>> nb_classifier = NaiveBayesClassifier.train(train_feats, estimator=LaplaceProbDist)
>>> accuracy(nb_classifier, test_feats)

As you can see, accuracy is slightly lower, so choose your estimator parameter carefully.

You cannot use nltk.probability.MLEProbDist as the estimator, or any ProbDistI subclass that does not take the bins keyword argument. Training will fail with TypeError: __init__() got an unexpected keyword argument 'bins'.

Manual training

You don't have to use the train() class method to construct a NaiveBayesClassifier. You can instead create the label_probdist and feature_probdist variables manually. The label_probdist variable should be an instance of ProbDistI, and should contain the prior probabilities for each label. The feature_probdist variable should be a dict whose keys are tuples of the form (label, feature name) and whose values are instances of ProbDistI that have the probabilities for each feature value. In our case, each ProbDistI should have only one value, True=1. Here's a very simple example using a manually constructed DictionaryProbDist class:

>>> from nltk.probability import DictionaryProbDist
>>> label_probdist = DictionaryProbDist({'pos': 0.5, 'neg': 0.5})
>>> true_probdist = DictionaryProbDist({True: 1})
>>> feature_probdist = {('pos', 'yes'): true_probdist, ('neg', 'no'): true_probdist}
>>> classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
>>> classifier.classify({'yes': True})
>>> classifier.classify({'no': True})

See also

In the next recipes, we will train two more classifiers, DecisionTreeClassifier and MaxentClassifier. In the Measuring precision and recall of a classifier recipe in this chapter, we will use precision and recall instead of accuracy to evaluate the classifiers. And then in the Calculating high information words recipe, we will see how using only the most informative features can improve classifier performance.

The movie_reviews corpus is an instance of CategorizedPlaintextCorpusReader, which is covered in the Creating a categorized text corpus recipe in Chapter 3, Creating Custom Corpora.

