Let's apply support vector machines to a classification problem. In recent years, support vector machines have been used successfully in the task of character recognition. Given an image, the classifier must predict the character that is depicted. Character recognition is a component of many optical character-recognition systems. Even small images require high-dimensional representations when raw pixel intensities are used as features. If the classes are linearly inseparable and must be mapped to a higher-dimensional feature space, the dimensions of the feature space can become even larger. Fortunately, SVMs are suited to working with such data efficiently. First, we will use scikit-learn to train a support vector machine to recognize handwritten digits. Then, we will work on a more challenging problem: recognizing alphanumeric characters in photographs.
The Mixed National Institute of Standards and Technology database is a collection of 70,000 images of handwritten digits. The digits were sampled from documents written by employees of the US Census Bureau and American high school students. The images are grayscale and 28 x 28 pixels in dimension. Let's inspect some of the images using the following script:
>>> import matplotlib.pyplot as plt >>> from sklearn.datasets import fetch_mldata >>> import matplotlib.cm as cm >>> digits = fetch_mldata('MNIST original', data_home='data/mnist').data >>> counter = 1 >>> for i in range(1, 4): >>> for j in range(1, 6): >>> plt.subplot(3, 5, counter) >>> plt.imshow(digits[(i - 1) * 8000 + j].reshape((28, 28)), cmap=cm.Greys_r) >>> plt.axis('off') >>> counter += 1 >>> plt.show()
First, we load the data. scikit-learn provides the fetch_mldata
convenience function to download the data set if it is not found on disk, and read it into an object. Then, we create a subplot for five instances for the digits zero, one, and two. The script produces the following figure:
The MNIST data set is partitioned into a training set of 60,000 images and test set of 10,000 images. The dataset is commonly used to evaluate a variety of machine learning models; it is popular because little preprocessing is required. Let's use scikit-learn to build a classifier that can predict the digit depicted in an image.
First, we import the necessary classes:
from sklearn.datasets import fetch_mldata from sklearn.pipeline import Pipeline from sklearn.preprocessing import scale from sklearn.cross_validation import train_test_split from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.metrics import classification_report
The script will fork additional processes during grid search, which requires execution from a __main_
_ block.
if __name__ == '__main__': data = fetch_mldata('MNIST original', data_home='data/mnist') X, y = data.data, data.target X = X/255.0*2 – 1
Next, we load the data using the fetch_mldata
convenience function. We scale the features and center each feature around the origin. We then split the preprocessed data into training and test sets using the following line of code:
X_train, X_test, y_train, y_test = train_test_split(X, y)
Next, we instantiate an SVC
, or support vector classifier, object. This object exposes an API like that of scikit-learn's other estimators; the classifier is trained using the fit
method, and predictions are made using the predict
method. If you consult the documentation for SVC
, you will find that the estimator requires more hyperparameters than most of the other estimators we discussed. It is common for more powerful estimators to require more hyperparameters. The most interesting hyperparameters for SVC
are set by the kernel
, gamma
, and C
keyword arguments. The kernel
keyword argument specifies the kernel to be used. scikit-learn provides implementations of the linear, polynomial, sigmoid, and radial basis function kernels. The degree
keyword argument should also be set when the polynomial kernel is used. C
controls regularization; it is similar to the lambda hyperparameter we used for logistic regression. The keyword argument gamma
is the kernel coefficient for the sigmoid, polynomial, and RBF kernels. Setting these hyperparameters can be challenging, so we tune them by grid searching with the following code.
pipeline = Pipeline([ ('clf', SVC(kernel='rbf', gamma=0.01, C=100)) ]) print X_train.shape parameters = { 'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1), 'clf__C': (0.1, 0.3, 1, 3, 10, 30), } grid_search = GridSearchCV(pipeline, parameters, n_jobs=2, verbose=1, scoring='accuracy') grid_search.fit(X_train[:10000], y_train[:10000]) print 'Best score: %0.3f' % grid_search.best_score_ print 'Best parameters set:' best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print ' %s: %r' % (param_name, best_parameters[param_name]) predictions = grid_search.predict(X_test) print classification_report(y_test, predictions)
The following is the output of the preceding script:
Fitting 3 folds for each of 30 candidates, totalling 90 fits [Parallel(n_jobs=2)]: Done 1 jobs | elapsed: 7.7min [Parallel(n_jobs=2)]: Done 50 jobs | elapsed: 201.2min [Parallel(n_jobs=2)]: Done 88 out of 90 | elapsed: 304.8min remaining: 6.9min [Parallel(n_jobs=2)]: Done 90 out of 90 | elapsed: 309.2min finished Best score: 0.966 Best parameters set: clf__C: 3 clf__gamma: 0.01 precision recall f1-score support 0.0 0.98 0.99 0.99 1758 1.0 0.98 0.99 0.98 1968 2.0 0.95 0.97 0.96 1727 3.0 0.97 0.95 0.96 1803 4.0 0.97 0.98 0.97 1714 5.0 0.96 0.96 0.96 1535 6.0 0.98 0.98 0.98 1758 7.0 0.97 0.96 0.97 1840 8.0 0.95 0.96 0.96 1668 9.0 0.96 0.95 0.96 1729 avg / total 0.97 0.97 0.97 17500
The best model has an average F1 score of 0.97; this score can be increased further by training on more than the first ten thousand instances.
Now let's try a more challenging problem. We will classify alphanumeric characters in natural images. The Chars74K dataset, collected by T. E. de Campos, B. R. Babu, and M. Varma for Character Recognition in Natural Images, contains more than 74,000 images of the digits zero through to nine and the characters for both cases of the English alphabet. The following are three examples of images of the lowercase letter z
. Chars74K can be downloaded from http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/.
Several types of images comprise the collection. We will use 7,705 images of characters that were extracted from photographs of street scenes taken in Bangalore, India. In contrast to MNIST, the images in this portion of Chars74K depict the characters in a variety of fonts, colors, and perturbations. After expanding the archive, we will use the files in the English/Img/GoodImg/Bmp/
directory. First we will import the necessary classes.
import os import numpy as np from sklearn.svm import SVC from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report import Image
Next we will define a function that resizes images using the Python Image Library:
def resize_and_crop(image, size): img_ratio = image.size[0] / float(image.size[1]) ratio = size[0] / float(size[1]) if ratio > img_ratio: image = image.resize((size[0], size[0] * image.size[1] / image.size[0]), Image.ANTIALIAS) image = image.crop((0, 0, 30, 30)) elif ratio < img_ratio: image = image.resize((size[1] * image.size[0] / image.size[1], size[1]), Image.ANTIALIAS) image = image.crop((0, 0, 30, 30)) else: image = image.resize((size[0], size[1]), Image.ANTIALIAS) return image
Then we load will the images for each of the 62 classes and convert them to grayscale. Unlike MNIST, the images of Chars74K do not have consistent dimensions, so we will resize them to 30 pixels on a side using the resize_and_crop function we defined. Finally, we will convert the processed images to a NumPy array:
X = [] y = [] for path, subdirs, files in os.walk('data/English/Img/GoodImg/Bmp/'): for filename in files: f = os.path.join(path, filename) img = Image.open(f).convert('L') # convert to grayscale img_resized = resize_and_crop(img, (30, 30)) img_resized = np.asarray(img_resized.getdata(), dtype=np.float64) .reshape((img_resized.size[1] * img_resized.size[0], 1)) target = filename[3:filename.index('-')] X.append(img_resized) y.append(target) X = np.array(X) X = X.reshape(X.shape[:2]) We will then train a support vector classifier with a polynomial kernel.classifier = SVC(verbose=0, kernel='poly', degree=3) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) classifier.fit(X_train, y_train) predictions = classifier.predict(X_test) print classification_report(y_test, predictions)
The preceding script produces the following output:
precision recall f1-score support 001 0.24 0.22 0.23 23 002 0.24 0.45 0.32 20 ... 061 0.33 0.15 0.21 13 062 0.08 0.25 0.12 8 avg / total 0.41 0.34 0.36 1927
It is apparent that this is a more challenging task than classifying digits in MNIST. The appearances of the characters vary more widely, the characters are perturbed more since the images were sampled from photographs rather than scanned documents. Furthermore, there are far fewer training instances for each class in Chars74K than there are in MNIST. The performance of the classifier could be improved by adding training data, preprocessing the images differently, or using more sophisticated feature representations.