In this recipe, we introduce support vector machines, or SVMs. These models can be used for classification and regression. Here, we illustrate how to use linear and nonlinear SVMs on a simple classification task. This recipe is inspired by an example in the scikit-learn documentation (see http://scikit-learn.org/stable/auto_examples/svm/plot_svm_nonlinear.html).
>>> import numpy as np import pandas as pd import sklearn import sklearn.datasets as ds import sklearn.model_selection as ms import sklearn.svm as svm import matplotlib.pyplot as plt %matplotlib inline
>>> X = np.random.randn(200, 2) y = X[:, 0] + X[:, 1] > 1
>>> # We train the classifier. est = svm.LinearSVC() est.fit(X, y)
>>> # We generate a grid in the square [-3,3 ]^2. xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500)) # This function takes a SVM estimator as input. def plot_decision_function(est, title): # We evaluate the decision function on the grid. Z = est.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) cmap = plt.cm.Blues # We display the decision function on the grid. fig, ax = plt.subplots(1, 1, figsize=(5, 5)) ax.imshow(Z, extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto', origin='lower', cmap=cmap) # We display the boundaries. ax.contour(xx, yy, Z, levels=[0], linewidths=2, colors='k') # We display the points with their true labels. ax.scatter(X[:, 0], X[:, 1], s=50, c=.5 + .5 * y, edgecolors='k', lw=1, cmap=cmap, vmin=0, vmax=1) ax.axhline(0, color='k', ls='--') ax.axvline(0, color='k', ls='--') ax.axis([-3, 3, -3, 3]) ax.set_axis_off() ax.set_title(title)
>>> ax = plot_decision_function( est, "Linearly separable, linear SVC")
The linear SVC tried to separate the points with a line and it did a pretty good job here.
1
if the coordinates have different signs. This classification is not linearly separable. Therefore, a linear SVC fails completely:>>> y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0) # We train the classifier. est = ms.GridSearchCV(svm.LinearSVC(), {'C': np.logspace(-3., 3., 10)}) est.fit(X, y) print("Score: {0:.1f}".format( ms.cross_val_score(est, X, y).mean())) # We plot the decision function. ax = plot_decision_function( est, "XOR, linear SVC") Score: 0.5
>>> y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0) est = ms.GridSearchCV( svm.SVC(), {'C': np.logspace(-3., 3., 10), 'gamma': np.logspace(-3., 3., 10)}) est.fit(X, y) print("Score: {0:.3f}".format( ms.cross_val_score(est, X, y).mean())) plot_decision_function( est.best_estimator_, "XOR, non-linear SVC") Score: 0.955
This time, the nonlinear SVC successfully managed to classify these nonlinearly separable points.
A two-class linear SVC tries to find a hyperplane (defined as a linear equation) that best separates the two sets of points (grouped according to their labels). There is also the constraint that this separating hyperplane needs to be as far as possible from the points. This method works best when such a hyperplane exists. Otherwise, this method can fail completely, as we saw in the XOR example. XOR is known as being a nonlinearly separable operation.
The SVM classes in scikit-learn have a C
hyperparameter. This hyperparameter trades off misclassification of training examples against simplicity of the decision surface. A low C
value makes the decision surface smooth, while a high C
value aims at classifying all training examples correctly. This is another example where a hyperparameter quantifies the bias-variance trade-off. This hyperparameter can be chosen with cross-validation and grid search.
The linear SVC can also be extended to multiclass problems. The multiclass SVC is directly implemented in scikit-learn.
The nonlinear SVC works by considering a nonlinear transformation from the original space into a higher dimensional space. This nonlinear transformation can increase the linear separability of the classes. In practice, all dot products are replaced by the kernel.
There are several widely-used nonlinear kernels. By default, SVC uses Gaussian radial basis functions:
Here, is a hyperparameter of the model that can be chosen with grid search and cross-validation.
The function does not need to be computed explicitly. This is the kernel trick; it suffices to know the kernel . The existence of a function corresponding to a given kernel is guaranteed by a mathematical theorem in functional analysis (Mercer's theorem).
Here are a few references about support vector machines: