Chapter 9. From the Perceptron to Support Vector Machines

In the previous chapter we discussed the perceptron. As a binary classifier, the perceptron cannot be used to effectively classify linearly inseparable feature representations. We encountered a similar problem to this in our discussion of multiple linear regression in Chapter 2, Linear Regression; we examined a dataset in which the response variable was not linearly related to the explanatory variables. To improve the accuracy of the model, we introduced a special case of multiple linear regression called polynomial regression. We created synthetic combinations of features, and were able to model a linear relationship between the response variable and the features in the higher-dimensional feature space.

While this method of increasing the dimensions of the feature space may seem like a promising technique to use when approximating nonlinear functions with linear models, it suffers from two related problems. The first is a computational problem; computing the mapped features and working with larger vectors requires more computing power. The second problem pertains to generalization; increasing the dimensions of the feature representation introduces the curse of dimensionality. Learning from high-dimensional feature representations requires exponentially more training data to avoid overfitting.

In this chapter, we will discuss a powerful model for classification and regression called the support vector machine (SVM). First, we will revisit mapping features to higher-dimensional spaces. Then, we will discuss how support vector machines mitigate the computation and generalization problems encountered when learning from the data mapped to higher-dimensional spaces. Entire books are devoted to describing support vector machines, and describing the optimization algorithms used to train SVMs requires more advanced math than we have used in previous chapters. Instead of working through toy examples in detail as we have done in previous chapters, we will try to develop an intuition for how support vector machines work in order to apply them effectively with scikit-learn.

Kernels and the kernel trick

Recall that the perceptron separates the instances of the positive class from the instances of the negative class using a hyperplane as a decision boundary. The decision boundary is given by the following equation:

Kernels and the kernel trick

Predictions are made using the following function:

Kernels and the kernel trick

Note that previously we expressed the inner product Kernels and the kernel trick as Kernels and the kernel trick. To be consistent with the notational conventions used for support vector machines, we will adopt the former notation in this chapter.

While the proof is beyond the scope of this chapter, we can write the model differently. The following expression of the model is called the dual form. The expression we used previously is the primal form:

Kernels and the kernel trick

The most important difference between the primal and dual forms is that the primal form computes the inner product of the model parameters and the test instance's feature vector, while the dual form computes the inner product of the training instances and the test instance's feature vector. Shortly, we will exploit this property of the dual form to work with linearly inseparable classes. First, we must formalize our definition of mapping features to higher-dimensional spaces.

In the section on polynomial regression in Chapter 2, Linear Regression, we mapped features to a higher-dimensional space in which they were linearly related to the response variable. The mapping increased the number of features by creating quadratic terms from combinations of the original features. These synthetic features allowed us to express a nonlinear function with a linear model. In general, a mapping is given by the following expression:

Kernels and the kernel trick

The plot on the left in the following figure shows the original feature space of a linearly inseparable data set. The plot on the right shows that the data is linearly separable after mapping to a higher-dimensional space:

Kernels and the kernel trick

Let's return to the dual form of our decision boundary, and the observation that the feature vectors appear only inside of a dot product. We could map the data to a higher-dimensional space by applying the mapping to the feature vectors as follows:

Kernels and the kernel trick
Kernels and the kernel trick

As noted, this mapping allows us to express more complex models, but it introduces computation and generalization problems. Mapping the feature vectors and computing their dot products can require a prohibitively large amount of processing power.

Observe in the second equation that while we have mapped the feature vectors to a higher-dimensional space, the feature vectors still only appear as a dot product. The dot product is scalar; we do not require the mapped feature vectors once this scalar has been computed. If we can use a different method to produce the same scalar as the dot product of the mapped vectors, we can avoid the costly work of explicitly computing the dot product and mapping the feature vectors.

Fortunately, there is such a method called the kernel trick. A kernel is a function that, given the original feature vectors, returns the same value as the dot product of its corresponding mapped feature vectors. Kernels do not explicitly map the feature vectors to a higher-dimensional space, or calculate the dot product of the mapped vectors. Kernels produce the same value through a different series of operations that can often be computed more efficiently. Kernels are defined more formally in the following equation:

Kernels and the kernel trick

Let's demonstrate how kernels work. Suppose that we have two feature vectors, x and z:

Kernels and the kernel trick
Kernels and the kernel trick

In our model, we wish to map the feature vectors to a higher-dimensional space using the following transformation:

Kernels and the kernel trick

The dot product of the mapped, normalized feature vectors is equivalent to:

Kernels and the kernel trick

The kernel given by the following equation produces the same value as the dot product of the mapped feature vectors:

Kernels and the kernel trick
Kernels and the kernel trick

Let's plug in values for the feature vectors to make this example more concrete:

Kernels and the kernel trick
Kernels and the kernel trick
Kernels and the kernel trick
Kernels and the kernel trick

The kernel Kernels and the kernel trick produced the same value as the dot product Kernels and the kernel trick of the mapped feature vectors, but never explicitly mapped the feature vectors to the higher-dimensional space and required fewer arithmetic operations. This example used only two dimensional feature vectors. Data sets with even a modest number of features can result in mapped feature spaces with massive dimensions. scikit-learn provides several commonly used kernels, including the polynomial, sigmoid, Gaussian, and linear kernels. Polynomial kernels are given by the following equation:

Kernels and the kernel trick

Quadratic kernels, or polynomial kernels where k is equal to 2, are commonly used in natural language processing.

The sigmoid kernel is given by the following equation. Kernels and the kernel trick and Kernels and the kernel trick are hyperparameters that can be tuned through cross-validation:

Kernels and the kernel trick

The Gaussian kernel is a good first choice for problems requiring nonlinear models. The Gaussian kernel is a radial basis function. A decision boundary that is a hyperplane in the mapped feature space is similar to a decision boundary that is a hypersphere in the original space. The feature space produced by the Gaussian kernel can have an infinite number of dimensions, a feat that would be impossible otherwise. The Gaussian kernel is given by the following equation:

Kernels and the kernel trick

Kernels and the kernel trick is a hyperparameter. It is always important to scale the features when using support vector machines, but feature scaling is especially important when using the Gaussian kernel.

Choosing a kernel can be challenging. Ideally, a kernel will measure the similarity between instances in a way that is useful to the task. While kernels are commonly used with support vector machines, they can also be used with any model that can be expressed in terms of the dot product of two feature vectors, including logistic regression, perceptrons, and principal component analysis. In the next section, we will address the second problem caused by mapping to high-dimensional feature spaces: generalization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset