Support vector machines

A support vector machine is a linear discriminative classifier that attempts to maximize the margin between classes during training. This approach is similar to the definition of a hyperplane through the training of the logistic regression (refer to the Binomial classification section in Chapter 6, Regression and Regularization). The main difference is that the support vector machine computes the optimum separating hyperplane between groups or classes of observations. The hyperplane is indeed the equation that represents the model generated through training.

The quality of the SVM depends on the distance, known as margin, between the different classes of observations. The accuracy of the classifier increases as the margin increases.

The linear SVM

First, let's apply the support vector machine to extract a linear model (classifier or regression) for a labeled set of observations. There are two scenarios for defining a linear model. The labeled observations are as follows:

  • They are naturally segregated in the features space (the separable case)
  • They are intermingled with overlap (the nonseparable case)

It is easy to understand the concept of an optimal separating hyperplane in cases where the observations are naturally segregated.

The separable case – the hard margin

The concept of separating a training set of observations with a hyperplane is explained in a better way with a two-dimensional (x, y) set of observations with two classes C1 and C2. The label y has the value -1 or +1.

The equation for the separating hyperplane is defined by the linear equation y=w.xT +w0, which sits in the midpoint between the boundary data points for the class C1 (H1: w.x T + w0 + 1=0) and class C2 (H2: w.xT + w0 - 1). The planes H1 and H2 are the support vectors:

The separable case – the hard margin

The visualization of the hard margin in the support vector machine

In the separable case, the support vectors fully segregate the observations into two distinct classes. The margin between the two support vectors is the same for all the observations and is known as the hard margin.

Note

The separable case

M1: The support vectors equation w is represented as:

The separable case – the hard margin

M2: The hard margin optimization problem is given by:

The separable case – the hard margin

The nonseparable case – the soft margin

In the nonseparable case, the support vectors cannot completely segregate observations through training. They merely become linear functions that penalize the few observations or outliers that are located outside (or beyond) their respective support vector H1 or H2. The penalty variable ξ, also known as the slack variable, increases if the outlier is further away from the support vector:

The nonseparable case – the soft margin

The visualization of the hard margin in the support vector machine

The observations that belong to the appropriate (or own) class do not have to be penalized. The condition is similar to the hard margin, which means that the slack ξ is null. This technique penalizes the observations that belong to the class but are located beyond their support vectors; the slack ξ increases as the observations get closer to the support vector of the other class and beyond. The margin is then known as a soft margin because the separating hyperplane is enforced through a slack variable.

Note

The nonseparable case

M3: The optimization of the soft margin for a linear SVM with C formulation is defined as:

The nonseparable case – the soft margin

Here, C is the penalty (or inversed regularization) factor.

You may wonder how the minimization of the margin error is related to the loss function and the penalization factor, introduced for the ridge regression (refer to the Numerical optimization section in Chapter 6, Regression and Regularization). The second factor in the formula corresponds to the ubiquitous loss function. You will certainly be able to recognize the first term as the L2 regularization penalty with λ = 1/2C.

The problem can be reformulated as the minimization of a function known as the primal problem [8:6].

Note

M4: The primal problem formulation of the support vector classifier using the L2 regularization is as follows:

The nonseparable case – the soft margin

The C penalty factor is the inverse of the L2 regularization factor. The loss function L is known as the hinge loss. The formulation of the margin using the C penalty (or cost) parameter is known as the C-SVM formulation. C-SVM is sometimes called the C-Epsilon SVM formulation for the nonseparable case.

The υ-SVM (or Nu-SVM) is an alternative formulation to C-SVM. The formulation is more descriptive than C-SVM; υ represents the upper bound of the training observations that are poorly classified and the lower bound of the observations on the support vectors [8:7].

Note

M5: The ν-SVM formulation of a linear SVM using the L2 regularization is defined as:

The nonseparable case – the soft margin

Here, ρ is a margin factor used as an optimization variable.

The C-SVM formulation is used throughout the chapters for the binary, one class support vector classifier as well as the support vector regression.

Note

Sequential Minimal Optimization

The optimization problem consists of the minimization of a quadratic objective function (w2) subject to N linear constraints, N being the number of observations. The time complexity of the algorithm is O(N3 ). A more efficient algorithm known as Sequential Minimal Optimization (SMO) has been introduced to reduce the time complexity to O(N2).

The nonlinear SVM

So far, we assumed that the separating hyperplane and therefore the support vectors are linear functions. Unfortunately, such assumptions are not always correct in the real world.

Max-margin classification

Support vector machines are known as large or maximum margin classifiers. The objective is to maximize the margin between the support vectors with hard constraints for separable (similarly, soft constraints with slack variables for nonseparable) cases.

The model parameters {wi} are rescaled during optimization to guarantee that the margin is at least 1. Such algorithms are known as maximum (or large) margin classifiers.

The problem of fitting a nonlinear model into the labeled observations using support vectors is not an easy task. A better alternative consists of mapping the problem to a new and higher dimensional space using a nonlinear transformation. The nonlinear separating hyperplane becomes a linear plane in the new space, as illustrated in the following diagram:

Max-margin classification

An illustration of the kernel trick in the SVM

The nonlinear SVM is implemented using a basis function ϕ (x). The formulation of the nonlinear C-SVM is very similar to the linear case. The only difference is the constraint along with the support vector, using the basis function φ (M6):

Max-margin classification

The minimization of wT.ϕ(x) in the preceding equation requires the computation of the inner product ϕ(x)T.ϕ(x). The inner product of the basis functions is implemented using one of the kernel functions introduced in the first section. The optimization of the preceding convex problem computes the optimal hyperplane w* as the kernelized linear combination of the training samples y. ϕ (x) and Lagrange multipliers. This formulation of the optimization problem is known as the SVM dual problem. The description of the dual problem is mentioned as a reference and is well beyond the scope of this book [8:8].

Note

M7: The optimal hyperplane for the SVM dual problem is defined as:

Max-margin classification

M8: The hard margin formulation for the SVM dual problem is defined as:

Max-margin classification

The kernel trick

The transformation (x,x') => K(x,x') maps a nonlinear problem into a linear problem in a higher dimensional space. It is known as the kernel trick.

Let's consider, for example, the polynomial kernel defined in the first section with a degree d = 2 and coefficient of C0 = 1 in a two-dimension space. The polynomial kernel function of two vectors, x = [x1, x2] and z = [x'1, x'2], is decomposed into a linear function in a 6 dimension space:

The kernel trick
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset