Scikit-learn is one of the best machine learning libraries available in any programming language. It contains all sorts of machine learning algorithms for many different purposes, but they all follow the same fit/predict design pattern:
We won't be accessing the scikit-learn
models directly in this recipe. Instead, we'll be using NLTK's SklearnClassifier
class, which is a wrapper class around a scikit-learn
model to make it conform to NLTK's ClassifierI
interface. This means that the SklearnClassifier
class can be trained and used much like the classifiers we've used in the previous recipes in this chapter.
To use the
SklearnClassifier
class, you must have scikit-learn
installed. Instructions are available online at http://scikit-learn.org/stable/install.html. If you have all the dependencies installed, such as NumPy
and SciPy
, you should be able to install scikit-learn
with pip
:
$ pip install scikit-learn
To test if everything is installed correctly, try to import the SklearnClassifier
class:
>>> from nltk.classify import scikitlearn
If the import fails, then you are still missing scikit-learn
and its dependencies.
Training an
SklearnClassifier
class has a slightly different series of steps than classifiers covered in the previous recipes of this chapter:
sklearn
algorithm.SklearnClassifier
class with the chosen algorithm.SklearnClassifier
class with your training features.The main difference with NLTK classifiers is that steps 3 and 4 are usually combined. Let's put this into practice using the MultinomialNB
classifier from sklearn
. Refer to the earlier recipe, Training a Naive Bayes classifier, for details on constructing train_feats
and test_feats
:
>>> from nltk.classify.scikitlearn import SklearnClassifier >>> from sklearn.naive_bayes import MultinomialNB >>> sk_classifier = SklearnClassifier(MultinomialNB()) >>> sk_classifier.train(train_feats) <SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>
Now that we have a trained classifier, we can evaluate the accuracy:
>>> accuracy(sk_classifier, test_feats) 0.83
The SklearnClassifier
class is a small wrapper class whose main job is to convert NLTK feature dictionaries into sklearn
compatible feature vectors. Here's the complete class code, minus all comments, docstrings, and most imports:
from sklearn.feature_extraction import DictVectorizer from sklearn.preprocessing import LabelEncoder class SklearnClassifier(ClassifierI): def __init__(self, estimator, dtype=float, sparse=True): self._clf = estimator self._encoder = LabelEncoder() self._vectorizer = DictVectorizer(dtype=dtype, sparse=sparse) def batch_classify(self, featuresets): X = self._vectorizer.transform(featuresets) classes = self._encoder.classes_ return [classes[i] for i in self._clf.predict(X)] def batch_prob_classify(self, featuresets): X = self._vectorizer.transform(featuresets) y_proba_list = self._clf.predict_proba(X) return [self._make_probdist(y_proba) for y_proba in y_proba_list] def labels(self): return list(self._encoder.classes_) def train(self, labeled_featuresets): X, y = list(compat.izip(*labeled_featuresets)) X = self._vectorizer.fit_transform(X) y = self._encoder.fit_transform(y) self._clf.fit(X, y) return self def _make_probdist(self, y_proba): classes = self._encoder.classes_ return DictionaryProbDist(dict((classes[i], p) for i, p in enumerate(y_proba)))
The class is initialized with an estimator, which is the algorithm we pass in, such as
MultinomialNB
. It then creates a LabelEncoder
and DictVectorizer
object. The LabelEncoder
object transforms label strings to numbers. For example, the pos
class may be encoded as 1
, and the neg
class may be encoded as 0
. The DictVectorizer
object is for transforming the NLTK feature dictionaries into sklearn
compatible feature vectors.
In the train()
method, the labeled feature sets are first encoded and transformed using the LabelEncoder
and DictVectorizer
objects. Then, the model we gave as an estimator, such as MultinomialNB
, is fit to the data. Because the sk_classifier
class is created before it is trained, you might forget to train it before you try to do any classification. Luckily, this will produce an exception with the message 'DictVectorizer' object has no attribute 'vocabulary_'
. Since Python dictionaries are unordered (unlike vectors), the DictVectorizer
object must maintain a vocabulary in order to know where in the vector a feature value belongs. This ensures that new feature dictionaries are vectorized in a manner consistent with the training features.
To classify a feature set, it is transformed to a vector and then passed to the trained model's predict()
method. This is done in the batch_classify()
method.
The scikit-learn
model contains many different algorithms for classification, and this recipe covers only a few. But not all the classification algorithms are compatible with the SklearnClassifier
class, because it uses sparse vectors. Sparse vectors are more efficient because they only store the data they need, using a kind of data compression. However, some algorithms, such as sklearn's DecisionTreeClassifier
, require dense vectors, which store every entry in the vector, even if it has no value. If you try a different algorithm with the SklearnClassifier
class and get an exception, this is probably why.
As you saw earlier, the MultinomialNB
algorithm got an accuracy of 83%. This is much higher than the 72.8% accuracy we got from NLTK's NaiveBayesClassifier
class. The big difference between these two algorithms is that MultinomialNB
can work with discrete feature values, such as word frequencies, whereas NaiveBayesClassifier
class assumes a small set of feature values, such as strings or Booleans. There is another sklearn Naive Bayes algorithm, BernoulliNB
, which can also work with discrete values by binarizing those values, so that the final values are 1
or 0
. Our features are actually already binarized, because the feature values are True
or False
:
>>> from sklearn.naive_bayes import BernoulliNB >>> sk_classifier = SklearnClassifier(BernoulliNB()) >>> sk_classifier.train(train_feats) <SklearnClassifier(BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))> >>> accuracy(sk_classifier, test_feats) 0.812
Clearly, the sklearn algorithm performs better than NLTK's Naive Bayes implementation. The sklearn classifiers also have a much smaller memory footprint, and will produce much smaller pickle files on disk. Their classification speed is often slightly slower than the NaiveBayesClassifier
class, but I think the accuracy and memory gains are quite worth it.
Earlier in this chapter, we covered the maximum entropy classifier. This algorithm is also known as logistic regression, and scikit-learn
provides a corresponding implementation.
>>> from sklearn.linear_model import LogisticRegression >>> sk_classifier = SklearnClassifier(LogisticRegression()) <SklearnClassifier(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001))> >>> sk_classifier.train(train_feats) >>> accuracy(sk_classifier, test_feats) 0.892
Again, we see that the sklearn algorithm has better performance than NLTK's MaxentClassifier
, which only had 72.2% accuracy. The logistic regression algorithm also has a much faster training time than the IIS or GIS algorithms, even when those algorithms have a limited number of iterations. This can be explained by sklearn's focus on optimized numeric processing using NumPy.
A third family of algorithms that NLTK does not support directly is Support Vector Machines, or SVM. These algorithms have been shown to be effective at learning on high-dimensional data, such as text classification, where every word feature counts as a dimension. You can learn more about support vector machines at https://en.wikipedia.org/wiki/Support_vector_machine. Here are some examples of using the sklearn
implementations:
>>> from sklearn.svm import SVC >>> sk_classifier = SklearnClassifier(svm.SVC()) >>> sk_classifier.train(train_feats) <SklearnClassifier(SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))> >>> accuracy(sk_classifier, test_feats) 0.69 >>> from sklearn.svm import LinearSVC >>> sk_classifier = SklearnClassifier(LinearSVC()) >>> sk_classifier.train(train_feats) <SklearnClassifier(LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))> >>> accuracy(sk_classifier, test_feats) 0.864 >>> from sklearn.svm import NuSVC >>> sk_classifier = SklearnClassifier(svm.NuSVC()) >>> sk_classifier.train(train_feats) /Users/jacob/py3env/lib/python3.3/site-packages/scipy/sparse/compressed.py:119: UserWarning: indptr array has non-integer dtype (float64) % self.indptr.dtype.name) <SklearnClassifier(NuSVC(cache_size=200, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, nu=0.5, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))> >>> accuracy(sk_classifier, test_feats) 0.882
You can see that in this case, NuSVC
is the most accurate SVM classifier, just above LinearSVC
, while SVC
is much less accurate than either. These accuracy differences are a result of the different algorithm implementations and the default parameters. You can learn more about these specific implementations at the following link:
If you are interested in exploring more aspects of machine learning with Python, the scikit-learn
documentation is a great place to start:
http://scikit-learn.org/stable/documentation.html
Earlier in this chapter, we covered the Training a Naive Bayes classifier and Training a maximum entropy classifier recipes. We will use the LinearSVC
and NuSVC
classifiers again in the following recipes.