Classifying with multiple binary classifiers

So far we have focused on binary classifiers, which classify with one of two possible labels. The same techniques for training a binary classifier can also be used to create a multi-class classifier, which is a classifier that can classify with one of the many possible labels. But there are also cases where you need to be able to classify with multiple labels. A classifier that can return more than one label is a multi-label classifier.

A common technique for creating a multi-label classifier is to combine many binary classifiers, one for each label. You train each binary classifier so that it either returns a known label or returns something else to signal that the label does not apply. Then, you can run all the binary classifiers on your feature set to collect all the applicable labels.

Getting ready

The reuters corpus contains multi-labeled text that we can use for training and evaluation:

>>> from nltk.corpus import reuters
>>> len(reuters.categories())
90

We will train one binary classifier per label, which means we will end up with 90 binary classifiers.

How to do it...

First, we should calculate the high information words in the reuters corpus. This is done with the reuters_high_info_words() function in featx.py:

from nltk.corpus import reuters

def reuters_high_info_words(score_fn=BigramAssocMeasures.chi_sq):
  labeled_words = []

  for label in reuters.categories():
    labeled_words.append((label, reuters.words(categories=[label])))

  return high_information_words(labeled_words, score_fn=score_fn)

Then, we need to get training and test feature sets based on those high information words. This is done with the reuters_train_test_feats() function, also found in featx.py. It defaults to using bag_of_words() as its feature_detector, but we will be overriding this using bag_of_words_in_set() to use only the high information words:

def reuters_train_test_feats(feature_detector=bag_of_words):
  train_feats = []
  test_feats = []
  for fileid in reuters.fileids():
    if fileid.startswith('training'):
      featlist = train_feats
    else: # fileid.startswith('test')
      featlist = test_feats
    feats = feature_detector(reuters.words(fileid))
    labels = reuters.categories(fileid)
    featlist.append((feats, labels))
  return train_feats, test_feats

We can use these two functions to get a list of multi-labeled training and testing feature sets.

>>> from featx import reuters_high_info_words, reuters_train_test_feats
>>> rwords = reuters_high_info_words()
>>> featdet = lambda words: bag_of_words_in_set(words, rwords)
>>> multi_train_feats, multi_test_feats = reuters_train_test_feats(featdet)

The multi_train_feats and multi_test_feats functions are multi-labeled feature sets. That means they have a list of labels instead of a single label, and they look like [(featureset, [label])], as each feature set can have one or more labels. With this training data, we can train multiple binary classifiers. The train_binary_classifiers() function in classification.py takes a training function, a list of multi-label feature sets, and a set of possible labels to return a dict of label : binary classifier:

def train_binary_classifiers(trainf, labelled_feats, labelset):
  pos_feats = collections.defaultdict(list)
  neg_feats = collections.defaultdict(list)
  classifiers = {}

  for feat, labels in labelled_feats:
    for label in labels:
      pos_feats[label].append(feat)

    for label in labelset - set(labels):
      neg_feats[label].append(feat)

  for label in labelset:
    postrain = [(feat, label) for feat in pos_feats[label]]
    negtrain = [(feat, '!%s' % label) for feat in neg_feats[label]]
    classifiers[label] = trainf(postrain + negtrain)

  return classifiers

To use this function, we need to provide a training function that takes a single argument, which is the training data. This will be a simple lambda wrapper around a sklearn logistic regression SklearnClassifier class.

>>> from classification import train_binary_classifiers
>>> trainf = lambda train_feats: SklearnClassifier(LogisticRegression()).train(train_feats)
>>> labelset = set(reuters.categories())
>>> classifiers = train_binary_classifiers(trainf, multi_train_feats, labelset)
>>> len(classifiers)
90

Also in classification.py, we can define a MultiBinaryClassifier class, which takes a list of labeled classifiers of the form [(label, classifier)], where the classifier is assumed to be a binary classifier that either returns the label or something else if the label doesn't apply.

from nltk.classify import MultiClassifierI

class MultiBinaryClassifier(MultiClassifierI):
  def __init__(self, *label_classifiers):
    self._label_classifiers = dict(label_classifiers)
    self._labels = sorted(self._label_classifiers.keys())
  
  def labels(self):
    return self._labels

  def classify(self, feats):
    lbls = set()

    for label, classifier in self._label_classifiers.items():
      if classifier.classify(feats) == label:
        lbls.add(label)

    return lbls

Now we can construct this class using the binary classifiers we just created:

>>> from classification import MultiBinaryClassifier
>>> multi_classifier = MultiBinaryClassifier(*classifiers.items())

To evaluate this classifier, we can use precision and recall, but not accuracy. That's because the accuracy function assumes single values, and doesn't take into account partial matches. For example, if the multi_classifier returns three labels for a feature set, and two of them are correct but the third is not, then the accuracy() function would mark that as incorrect. So, instead of using accuracy, we will use masi distance, which measures the partial overlap between two sets using the formula from this paper:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.3752

If the masi distance is close to 0, the better the match. But if the masi distance is close to 1, there is little or no overlap. A lower average masi distance, therefore, means more accurate partial matches. The multi_metrics() function in classification.py calculates the precision and recall of each label, along with the average masi distance.

import collections
from nltk import metrics

def multi_metrics(multi_classifier, test_feats):
  mds = []
  refsets = collections.defaultdict(set)
  testsets = collections.defaultdict(set)

  for i, (feat, labels) in enumerate(test_feats):
    for label in labels:
      refsets[label].add(i)

    guessed = multi_classifier.classify(feat)

    for label in guessed:
      testsets[label].add(i)

    mds.append(metrics.masi_distance(set(labels), guessed))

  avg_md = sum(mds) / float(len(mds))
  precisions = {}
  recalls = {}

  for label in multi_classifier.labels():
    precisions[label] = metrics.precision(refsets[label], testsets[label])
    recalls[label] = metrics.recall(refsets[label], testsets[label])

  return precisions, recalls, avg_md

Using this with the multi_classifier function we just created gives us the following results:

>>> from classification import multi_metrics
>>> multi_precisions, multi_recalls, avg_md = multi_metrics(multi_classifier, multi_test_feats)
>>> avg_md
0.23310715863026216

So our average masi distance isn't too bad. Lower is better, which means our multi-label classifier is only partially accurate. Let's take a look at a few precisions and recalls:

>>> multi_precisions['soybean']
0.7857142857142857
>>> multi_recalls['soybean']
0.3333333333333333
>>> len(reuters.fileids(categories=['soybean']))
111

>>> multi_precisions['sunseed']
1.0
>>> multi_recalls['sunseed']
2.0
>>> len(reuters.fileids(categories=['crude']))
16

In general, the labels that have more feature sets will have higher precision and recall, and those with less feature sets will have lower performance. Many of the categories have 0 values, because when there are not a lot of feature sets for a classifier to learn from, you can't expect it to perform well.

How it works...

The reuters_high_info_words() function is fairly simple; it constructs a list of [(label, words)] for each category of the reuters corpus, then passes it into the high_information_words() function to return a list of the most informative words in the reuters corpus.

With the resulting set of words, we create a feature detector function using the bag_of_words_in_set() function. This is then passed into the reuters_train_test_feats() function, which returns two lists, the first containing [(feats, labels)] for all the training files, and the second list has the same for all the test files.

Next, we train a binary classifier for each label using the train_binary_classifiers() function. This function constructs two lists for each label, one containing positive training feature sets and the other containing negative training feature sets. The positive feature sets are those feature sets that classify for the label. The negative feature sets for a label comes from the positive feature sets for all other labels. For example, a feature set that is positive for zinc and sunseed is a negative example for all the other 88 labels. Once we have positive and negative feature sets for each label, we can train a binary classifier for each label using the given training function.

With the resulting dictionary of binary classifiers, we create an instance of the MultiBinaryClassifier class. This class extends the nltk.classify.MultiClassifierI interface, which requires at least two functions:

  • The labels() function must return a list of possible labels.
  • The classify() function takes a single feature set and returns a set of labels. To create this set, we iterate over the binary classifiers, and any time a call to the classify() function returns its label, we add it to the set. If it returns something else, we continue.

The following is the inheritance diagram:

How it works...

Finally, we evaluate the multi-label classifier using the multi_metrics() function. It is similar to the precision_recall() function from the Measuring precision and recall of a classifier recipe, but in this case, we know that the classifier is an instance of the MultiClassifierI interface and it can therefore return multiple labels. It also keeps track of the masi distance for each set of classification labels using the nltk.metrics.masi_distance() function. The multi_metrics() function returns three values:

  • A dictionary of precisions for each label
  • A dictionary of recalls for each label
  • The average masi distance for each feature set

There's more...

The nature of the reuters corpus introduces the class-imbalance problem. This problem occurs when some labels have very few feature sets, and other labels have many. The binary classifiers that have few positive instances to train on, end up with far more negative instances, and are therefore strongly biased towards the negative label. There's nothing inherently wrong about this, as the bias reflects the data, but the negative instances can overwhelm the classifier to the point where it's nearly impossible to get a positive result. There are a number of advanced techniques for overcoming this problem, but they are out of the scope of this book. The paper available at http://www.ijetae.com/files/Volume2Issue4/IJETAE_0412_07.pdf provides a good starting reference of techniques to overcome this problem.

See also

The SklearnClassifier class is covered in the Training scikit-learn classifiers recipe in this chapter. The Measuring precision and recall of a classifier recipe shows how to evaluate a classifier, while the Calculating high information words recipe describes how to use only the best features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset