A high information word is a word that is strongly biased towards a single classification label. These are the kinds of words we saw when we called the show_most_informative_features()
method on both the NaiveBayesClassifier
class and the MaxentClassifier
class. Somewhat surprisingly, the top words are different for both classifiers. This discrepancy is due to how each classifier calculates the significance of each feature, and it's actually beneficial to have these different methods as they can be combined to improve accuracy, as we will see in the next recipe, Combining classifiers with voting.
The low information words are words that are common to all labels. It may be counter-intuitive, but eliminating these words from the training data can actually improve accuracy, precision, and recall. The reason this works is that using only high information words reduces the noise and confusion of a classifier's internal model. If all the words/features are highly biased one way or the other, it's much easier for the classifier to make a correct guess.
First, we need to calculate the high information words in the movie_review
corpus. We can do this using the high_information_words()
function in featx.py
:
from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist def high_information_words(labelled_words, score_fn=BigramAssocMeasures.chi_sq, min_score=5): word_fd = FreqDist() label_word_fd = ConditionalFreqDist() for label, words in labelled_words: for word in words: word_fd[word] += 1 label_word_fd[label][word] += 1 n_xx = label_word_fd.N() high_info_words = set() for label in label_word_fd.conditions(): n_xi = label_word_fd[label].N() word_scores = collections.defaultdict(int) for word, n_ii in label_word_fd[label].items(): n_ix = word_fd[word] score = score_fn(n_ii, (n_ix, n_xi), n_xx) word_scores[word] = score bestwords = [word for word, score in word_scores.items() if score >= min_score] high_info_words |= set(bestwords) return high_info_words
It takes one argument from a list of two tuples of the form [(label, words)]
where label
is the classification label, and words
is a list of words that occur under that label. It returns a set of the high information words.
Once we have the high information words, we use the feature detector function bag_of_words_in_set()
, also found in featx.py
, which will let us filter out all low information words.
def bag_of_words_in_set(words, goodwords): return bag_of_words(set(words) & set(goodwords))
With this new feature detector, we can call label_feats_from_corpus()
and get a new train_feats
and test_feats
function using split_label_feats()
. These two functions were covered in the Training a Naive Bayes classifier recipe earlier in this chapter.
>>> from featx import high_information_words, bag_of_words_in_set >>> labels = movie_reviews.categories() >>> labeled_words = [(l, movie_reviews.words(categories=[l])) for l in labels] >>> high_info_words = set(high_information_words(labeled_words)) >>> feat_det = lambda words: bag_of_words_in_set(words, high_info_words) >>> lfeats = label_feats_from_corpus(movie_reviews, feature_detector=feat_det) >>> train_feats, test_feats = split_label_feats(lfeats)
Now that we have new training and testing feature sets, let's train and evaluate a NaiveBayesClassifier
class:
>>> nb_classifier = NaiveBayesClassifier.train(train_feats) >>> accuracy(nb_classifier, test_feats) 0.91 >>> nb_precisions, nb_recalls = precision_recall(nb_classifier, test_feats) >>> nb_precisions['pos'] 0.8988326848249028 >>> nb_precisions['neg'] 0.9218106995884774 >>> nb_recalls['pos'] 0.924 >>> nb_recalls['neg'] 0.896
While the neg
precision and pos
recall have both decreased somewhat, neg
recall and pos
precision have increased drastically. Accuracy is now a little higher than the MaxentClassifier
class.
The high_information_words()
function starts by counting the frequency of every word, as well as the conditional frequency for each word within each label. This is why we need the words to be labeled, so we know how often each word occurs for each label.
Once we have the FreqDist
and ConditionalFreqDist
variables, we can score each word on a per-label basis.
The default score_fn
is nltk.metrics.BigramAssocMeasures.chi_sq()
, which calculates the chi-square score for each word using the following parameters:
The formula is n_xx * nltk.metrics.BigramAssocMeasures.phi_sq
. The phi_sq()
function is the squared Pearson correlation coefficient, which you can read more about at https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient.
The simplest way to think about these numbers is that the closer n_ii
is to n_ix
, the higher the score. Or, the more often a word occurs in a label, relative to its overall occurrence, the higher the score.
Once we have the scores for each word in each label, we can filter out all words whose score is below the min_score
threshold. We keep the words that meet or exceed the threshold and return all high scoring words in each label.
There are a number of other scoring functions available in the BigramAssocMeasures
class, such as
phi_sq()
for phi-square, pmi()
for pointwise mutual information, and jaccard()
for using the Jaccard index. They all take the same arguments, and so can be used interchangeably with chi_sq()
. These functions are all documented in http://www.nltk.org/_modules/nltk/metrics/association.html with links to the source code of the formulas.
Let's evaluate the
MaxentClassifier
class using the high information words feature sets:
>>> me_classifier = MaxentClassifier.train(train_feats, algorithm='gis', trace=0, max_iter=10, min_lldelta=0.5) >>> accuracy(me_classifier, test_feats) 0.912 >>> me_precisions, me_recalls = precision_recall(me_classifier, test_feats) >>> me_precisions['pos'] 0.8992248062015504 >>> me_precisions['neg'] 0.9256198347107438 >>> me_recalls['pos'] 0.928 >>> me_recalls['neg'] 0.896
This also led to significant improvements for MaxentClassifier
. But as we'll see, not all algorithms will benefit from high information word filtering, and in some cases, accuracy will decrease.
Now, let's evaluate the DecisionTreeClassifier
class:
>>> dt_classifier = DecisionTreeClassifier.train(train_feats, binary=True, depth_cutoff=20, support_cutoff=20, entropy_cutoff=0.01) >>> accuracy(dt_classifier, test_feats) 0.68600000000000005 >>> dt_precisions, dt_recalls = precision_recall(dt_classifier, test_feats) >>> dt_precisions['pos'] 0.6741573033707865 >>> dt_precisions['neg'] 0.69957081545064381 >>> dt_recalls['pos'] 0.71999999999999997 >>> dt_recalls['neg'] 0.65200000000000002
The accuracy is about the same, even with a larger depth_cutoff
, and smaller support_cutoff
and entropy_cutoff
. These results lead me to believe that the DecisionTreeClassifier
class was already putting the high information features at the top of the tree, and it will only improve if we increase the depth significantly. But that could make training time prohibitively long and risk over-fitting the tree.
Let's evaluate the LinearSVC SklearnClassifier
with the same train_feats
function:
>>> sk_classifier = SklearnClassifier(LinearSVC()).train(train_feats) >>> accuracy(sk_classifier, test_feats) 0.86 >>> sk_precisions, sk_recalls = precision_recall(sk_classifier, test_feats) >>> sk_precisions['pos'] 0.871900826446281 >>> sk_precisions['neg'] 0.8488372093023255 >>> sk_recalls['pos'] 0.844 >>> sk_recalls['neg'] 0.876
Its accuracy before was 86.4%, so we actually got a very slight decrease. In general, support vector machine and logistic regression-based algorithms will benefit less, or perhaps even be harmed, by pre-filtering the training features. This is because these algorithms are able to learn feature weights that correspond to the significance of each feature, whereas Naive Bayes algorithms do not.
We started this chapter with the Bag of words feature extraction recipe. The NaiveBayesClassifier
class was originally trained in the Training a Naive Bayes classifier recipe, and the MaxentClassifier
class was trained in the Training a maximum entropy classifier recipe. Details on precision and recall can be found in the Measuring precision and recall of a classifier recipe. We will be using only high information words in the next two recipes, where we combine classifiers.