A high information word is a word that is strongly biased towards a single classification label. These are the kinds of words we saw when we called the show_most_informative_features() method on both the NaiveBayesClassifier class and the MaxentClassifier class. Somewhat surprisingly, the top words are different for both classifiers. This discrepancy is due to how each classifier calculates the significance of each feature, and it's actually beneficial to have these different methods as they can be combined to improve accuracy, as we will see in the next recipe, Combining classifiers with voting.

The low information words are words that are common to all labels. It may be counter-intuitive, but eliminating these words from the training data can actually improve accuracy, precision, and recall. The reason this works is that using only high information words reduces the noise and confusion of a classifier's internal model. If all the words/features are highly biased one way or the other, it's much easier for the classifier to make a correct guess.

How to do it...

First, we need to calculate the high information words in the movie_review corpus. We can do this using the high_information_words() function in

from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def high_information_words(labelled_words, score_fn=BigramAssocMeasures.chi_sq, min_score=5):
  word_fd = FreqDist()
  label_word_fd = ConditionalFreqDist()

  for label, words in labelled_words:
    for word in words:
      word_fd[word] += 1
      label_word_fd[label][word] += 1

  n_xx = label_word_fd.N()
  high_info_words = set()

  for label in label_word_fd.conditions():
    n_xi = label_word_fd[label].N()
    word_scores = collections.defaultdict(int)

    for word, n_ii in label_word_fd[label].items():
      n_ix = word_fd[word]
      score = score_fn(n_ii, (n_ix, n_xi), n_xx)
      word_scores[word] = score

    bestwords = [word for word, score in word_scores.items() if score >= min_score]
    high_info_words |= set(bestwords)
return high_info_words

It takes one argument from a list of two tuples of the form [(label, words)] where label is the classification label, and words is a list of words that occur under that label. It returns a set of the high information words.

Once we have the high information words, we use the feature detector function bag_of_words_in_set(), also found in, which will let us filter out all low information words.

def bag_of_words_in_set(words, goodwords):
  return bag_of_words(set(words) & set(goodwords))

With this new feature detector, we can call label_feats_from_corpus() and get a new train_feats and test_feats function using split_label_feats(). These two functions were covered in the Training a Naive Bayes classifier recipe earlier in this chapter.

>>> from featx import high_information_words, bag_of_words_in_set
>>> labels = movie_reviews.categories()
>>> labeled_words = [(l, movie_reviews.words(categories=[l])) for l in labels]
>>> high_info_words = set(high_information_words(labeled_words))
>>> feat_det = lambda words: bag_of_words_in_set(words, high_info_words)
>>> lfeats = label_feats_from_corpus(movie_reviews, feature_detector=feat_det)
>>> train_feats, test_feats = split_label_feats(lfeats)

Now that we have new training and testing feature sets, let's train and evaluate a NaiveBayesClassifier class:

>>> nb_classifier = NaiveBayesClassifier.train(train_feats)
>>> accuracy(nb_classifier, test_feats)
>>> nb_precisions, nb_recalls = precision_recall(nb_classifier, test_feats)
>>> nb_precisions['pos']
>>> nb_precisions['neg']
>>> nb_recalls['pos']
>>> nb_recalls['neg']

While the neg precision and pos recall have both decreased somewhat, neg recall and pos precision have increased drastically. Accuracy is now a little higher than the MaxentClassifier class.

How it works...

The high_information_words() function starts by counting the frequency of every word, as well as the conditional frequency for each word within each label. This is why we need the words to be labeled, so we know how often each word occurs for each label.

Once we have the FreqDist and ConditionalFreqDist variables, we can score each word on a per-label basis.

The default score_fn is nltk.metrics.BigramAssocMeasures.chi_sq(), which calculates the chi-square score for each word using the following parameters:

  • n_ii: This is the frequency of the word for the label
  • n_ix: This is the total frequency of the word across all labels
  • n_xi: This is the total frequency of all words that occurred for the label
  • n_xx: This is the total frequency for all words in all labels

The formula is n_xx * nltk.metrics.BigramAssocMeasures.phi_sq. The phi_sq() function is the squared Pearson correlation coefficient, which you can read more about at

The simplest way to think about these numbers is that the closer n_ii is to n_ix, the higher the score. Or, the more often a word occurs in a label, relative to its overall occurrence, the higher the score.

Once we have the scores for each word in each label, we can filter out all words whose score is below the min_score threshold. We keep the words that meet or exceed the threshold and return all high scoring words in each label.


It is recommended to experiment with different values of min_score to see what happens. In some cases, less words may improve the metrics even more, while in other cases more words is better.

There's more...

There are a number of other scoring functions available in the BigramAssocMeasures class, such as phi_sq() for phi-square, pmi() for pointwise mutual information, and jaccard() for using the Jaccard index. They all take the same arguments, and so can be used interchangeably with chi_sq(). These functions are all documented in with links to the source code of the formulas.

The MaxentClassifier class with high information words

Let's evaluate the MaxentClassifier class using the high information words feature sets:

>>> me_classifier = MaxentClassifier.train(train_feats, algorithm='gis', trace=0, max_iter=10, min_lldelta=0.5)
>>> accuracy(me_classifier, test_feats)
>>> me_precisions, me_recalls = precision_recall(me_classifier, test_feats)
>>> me_precisions['pos']
>>> me_precisions['neg']
>>> me_recalls['pos']
>>> me_recalls['neg']

This also led to significant improvements for MaxentClassifier. But as we'll see, not all algorithms will benefit from high information word filtering, and in some cases, accuracy will decrease.

The DecisionTreeClassifier class with high information words

Now, let's evaluate the DecisionTreeClassifier class:

>>> dt_classifier = DecisionTreeClassifier.train(train_feats, binary=True, depth_cutoff=20, support_cutoff=20, entropy_cutoff=0.01)
>>> accuracy(dt_classifier, test_feats)
>>> dt_precisions, dt_recalls = precision_recall(dt_classifier, test_feats)
>>> dt_precisions['pos']
>>> dt_precisions['neg']
>>> dt_recalls['pos']
>>> dt_recalls['neg']

The accuracy is about the same, even with a larger depth_cutoff, and smaller support_cutoff and entropy_cutoff. These results lead me to believe that the DecisionTreeClassifier class was already putting the high information features at the top of the tree, and it will only improve if we increase the depth significantly. But that could make training time prohibitively long and risk over-fitting the tree.

The SklearnClassifier class with high information words

Let's evaluate the LinearSVC SklearnClassifier with the same train_feats function:

>>> sk_classifier = SklearnClassifier(LinearSVC()).train(train_feats)
>>> accuracy(sk_classifier, test_feats)
>>> sk_precisions, sk_recalls = precision_recall(sk_classifier, test_feats)
>>> sk_precisions['pos']
>>> sk_precisions['neg']
>>> sk_recalls['pos']
>>> sk_recalls['neg']

Its accuracy before was 86.4%, so we actually got a very slight decrease. In general, support vector machine and logistic regression-based algorithms will benefit less, or perhaps even be harmed, by pre-filtering the training features. This is because these algorithms are able to learn feature weights that correspond to the significance of each feature, whereas Naive Bayes algorithms do not.

See also

We started this chapter with the Bag of words feature extraction recipe. The NaiveBayesClassifier class was originally trained in the Training a Naive Bayes classifier recipe, and the MaxentClassifier class was trained in the Training a maximum entropy classifier recipe. Details on precision and recall can be found in the Measuring precision and recall of a classifier recipe. We will be using only high information words in the next two recipes, where we combine classifiers.

