Training scikit-learn classifiers

Scikit-learn is one of the best machine learning libraries available in any programming language. It contains all sorts of machine learning algorithms for many different purposes, but they all follow the same fit/predict design pattern:

  • Fit the model to the data
  • Use the model to make predictions

We won't be accessing the scikit-learn models directly in this recipe. Instead, we'll be using NLTK's SklearnClassifier class, which is a wrapper class around a scikit-learn model to make it conform to NLTK's ClassifierI interface. This means that the SklearnClassifier class can be trained and used much like the classifiers we've used in the previous recipes in this chapter.

Note

I may use the terms scikit-learn and sklearn interchangeably in this recipe.

Getting ready

To use the SklearnClassifier class, you must have scikit-learn installed. Instructions are available online at http://scikit-learn.org/stable/install.html. If you have all the dependencies installed, such as NumPy and SciPy, you should be able to install scikit-learn with pip:

$ pip install scikit-learn

To test if everything is installed correctly, try to import the SklearnClassifier class:

>>> from nltk.classify import scikitlearn

If the import fails, then you are still missing scikit-learn and its dependencies.

How to do it...

Training an SklearnClassifier class has a slightly different series of steps than classifiers covered in the previous recipes of this chapter:

  1. Create training features (covered in the previous recipes).
  2. Choose and import an sklearn algorithm.
  3. Construct an SklearnClassifier class with the chosen algorithm.
  4. Train the SklearnClassifier class with your training features.

The main difference with NLTK classifiers is that steps 3 and 4 are usually combined. Let's put this into practice using the MultinomialNB classifier from sklearn. Refer to the earlier recipe, Training a Naive Bayes classifier, for details on constructing train_feats and test_feats:

>>> from nltk.classify.scikitlearn import SklearnClassifier
>>> from sklearn.naive_bayes import MultinomialNB
>>> sk_classifier = SklearnClassifier(MultinomialNB())
>>> sk_classifier.train(train_feats)
<SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>

Now that we have a trained classifier, we can evaluate the accuracy:

>>> accuracy(sk_classifier, test_feats)
0.83

How it works...

The SklearnClassifier class is a small wrapper class whose main job is to convert NLTK feature dictionaries into sklearn compatible feature vectors. Here's the complete class code, minus all comments, docstrings, and most imports:

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder

class SklearnClassifier(ClassifierI):
    def __init__(self, estimator, dtype=float, sparse=True):
        self._clf = estimator
        self._encoder = LabelEncoder()
        self._vectorizer = DictVectorizer(dtype=dtype, sparse=sparse)

    def batch_classify(self, featuresets):
        X = self._vectorizer.transform(featuresets)
        classes = self._encoder.classes_
        return [classes[i] for i in self._clf.predict(X)]

    def batch_prob_classify(self, featuresets):
        X = self._vectorizer.transform(featuresets)
        y_proba_list = self._clf.predict_proba(X)
        return [self._make_probdist(y_proba) for y_proba in y_proba_list]

    def labels(self):
        return list(self._encoder.classes_)

    def train(self, labeled_featuresets):
        X, y = list(compat.izip(*labeled_featuresets))
        X = self._vectorizer.fit_transform(X)
        y = self._encoder.fit_transform(y)
        self._clf.fit(X, y)
        return self

    def _make_probdist(self, y_proba):
        classes = self._encoder.classes_
        return DictionaryProbDist(dict((classes[i], p) for i, p in enumerate(y_proba)))

The class is initialized with an estimator, which is the algorithm we pass in, such as MultinomialNB. It then creates a LabelEncoder and DictVectorizer object. The LabelEncoder object transforms label strings to numbers. For example, the pos class may be encoded as 1, and the neg class may be encoded as 0. The DictVectorizer object is for transforming the NLTK feature dictionaries into sklearn compatible feature vectors.

In the train() method, the labeled feature sets are first encoded and transformed using the LabelEncoder and DictVectorizer objects. Then, the model we gave as an estimator, such as MultinomialNB, is fit to the data. Because the sk_classifier class is created before it is trained, you might forget to train it before you try to do any classification. Luckily, this will produce an exception with the message 'DictVectorizer' object has no attribute 'vocabulary_'. Since Python dictionaries are unordered (unlike vectors), the DictVectorizer object must maintain a vocabulary in order to know where in the vector a feature value belongs. This ensures that new feature dictionaries are vectorized in a manner consistent with the training features.

To classify a feature set, it is transformed to a vector and then passed to the trained model's predict() method. This is done in the batch_classify() method.

There's more...

The scikit-learn model contains many different algorithms for classification, and this recipe covers only a few. But not all the classification algorithms are compatible with the SklearnClassifier class, because it uses sparse vectors. Sparse vectors are more efficient because they only store the data they need, using a kind of data compression. However, some algorithms, such as sklearn's DecisionTreeClassifier, require dense vectors, which store every entry in the vector, even if it has no value. If you try a different algorithm with the SklearnClassifier class and get an exception, this is probably why.

Comparing Naive Bayes algorithms

As you saw earlier, the MultinomialNB algorithm got an accuracy of 83%. This is much higher than the 72.8% accuracy we got from NLTK's NaiveBayesClassifier class. The big difference between these two algorithms is that MultinomialNB can work with discrete feature values, such as word frequencies, whereas NaiveBayesClassifier class assumes a small set of feature values, such as strings or Booleans. There is another sklearn Naive Bayes algorithm, BernoulliNB, which can also work with discrete values by binarizing those values, so that the final values are 1 or 0. Our features are actually already binarized, because the feature values are True or False:

>>> from sklearn.naive_bayes import BernoulliNB
>>> sk_classifier = SklearnClassifier(BernoulliNB())
>>> sk_classifier.train(train_feats)
<SklearnClassifier(BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))>
>>> accuracy(sk_classifier, test_feats)
0.812

Clearly, the sklearn algorithm performs better than NLTK's Naive Bayes implementation. The sklearn classifiers also have a much smaller memory footprint, and will produce much smaller pickle files on disk. Their classification speed is often slightly slower than the NaiveBayesClassifier class, but I think the accuracy and memory gains are quite worth it.

Training with logistic regression

Earlier in this chapter, we covered the maximum entropy classifier. This algorithm is also known as logistic regression, and scikit-learn provides a corresponding implementation.

>>> from sklearn.linear_model import LogisticRegression
>>> sk_classifier = SklearnClassifier(LogisticRegression())
<SklearnClassifier(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001))>
>>> sk_classifier.train(train_feats)
>>> accuracy(sk_classifier, test_feats)
0.892

Again, we see that the sklearn algorithm has better performance than NLTK's MaxentClassifier, which only had 72.2% accuracy. The logistic regression algorithm also has a much faster training time than the IIS or GIS algorithms, even when those algorithms have a limited number of iterations. This can be explained by sklearn's focus on optimized numeric processing using NumPy.

Training with LinearSVC

A third family of algorithms that NLTK does not support directly is Support Vector Machines, or SVM. These algorithms have been shown to be effective at learning on high-dimensional data, such as text classification, where every word feature counts as a dimension. You can learn more about support vector machines at https://en.wikipedia.org/wiki/Support_vector_machine. Here are some examples of using the sklearn implementations:

>>> from sklearn.svm import SVC
>>> sk_classifier = SklearnClassifier(svm.SVC())
>>> sk_classifier.train(train_feats)
<SklearnClassifier(SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))>
>>> accuracy(sk_classifier, test_feats)
0.69 

>>> from sklearn.svm import LinearSVC
>>> sk_classifier = SklearnClassifier(LinearSVC())
>>> sk_classifier.train(train_feats)
<SklearnClassifier(LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
         intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
         random_state=None, tol=0.0001, verbose=0))>
>>> accuracy(sk_classifier, test_feats)
0.864

>>> from sklearn.svm import NuSVC
>>> sk_classifier = SklearnClassifier(svm.NuSVC())
>>> sk_classifier.train(train_feats)
/Users/jacob/py3env/lib/python3.3/site-packages/scipy/sparse/compressed.py:119: UserWarning: indptr array has non-integer dtype (float64)
  % self.indptr.dtype.name)
<SklearnClassifier(NuSVC(cache_size=200, coef0=0.0, degree=3, gamma=0.0, kernel='rbf',
   max_iter=-1, nu=0.5, probability=False, random_state=None,
   shrinking=True, tol=0.001, verbose=False))>
>>> accuracy(sk_classifier, test_feats)
0.882

You can see that in this case, NuSVC is the most accurate SVM classifier, just above LinearSVC, while SVC is much less accurate than either. These accuracy differences are a result of the different algorithm implementations and the default parameters. You can learn more about these specific implementations at the following link:

http://scikit-learn.org/stable/modules/svm.html

See also

If you are interested in exploring more aspects of machine learning with Python, the scikit-learn documentation is a great place to start:

http://scikit-learn.org/stable/documentation.html

Earlier in this chapter, we covered the Training a Naive Bayes classifier and Training a maximum entropy classifier recipes. We will use the LinearSVC and NuSVC classifiers again in the following recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset