This chapter introduces kernel functions, binary support vectors classifiers, one-class support vector machines for anomaly detection, and support vector regression.
In the Binomial classification section in Chapter 6, Regression and Regularization, you learned the concept of hyperplanes to segregate observations from the training set and estimate the linear decision boundary. The logistic regression has at least one limitation: it requires that the datasets be linearly separated using a defined function (sigmoid). This limitation is especially an issue for high-dimension problems (large number of features that are highly nonlinearly dependent). Support vector machines (SVMs) overcome this limitation by estimating the optimal separating hyperplane using kernel functions.
In this chapter, we will cover the following topics:
Support vector machines are formulated as a convex optimization problem. The mathematical foundation of the related algorithms is described in this chapter, for reference.
Every machine learning model introduced in this book so far assumes that observations are represented by a feature vector of a fixed size. However, some real-world applications such as text mining or genomics do not lend themselves to this restriction. The critical element of the process of classification is to define a similarity or distance between two observations. Kernel functions allow developers to compute the similarity between observations without the need to encode them in feature vectors [8:1].
The concept of kernel methods may be a bit odd at first to a novice. Let's consider the example of the classification of proteins. Proteins have different lengths and compositions, but they do not prevent scientists from classifying them [8:2].
A protein is represented using a traditional molecular notation to which biochemists are familiar. Geneticists describe proteins in terms of a sequence of characters known as the protein sequence annotation. The sequence annotation encodes the structure and composition of the protein. The following image illustrates the molecular (left) and encoded (right) representation of a protein:
The classification and the clustering of a set of proteins require the definition of a similarity factor or distance used to evaluate and compare the proteins. For example, the similarity between three proteins can be defined as a normalized dot product of their sequence annotation:
You do not have to represent the entire sequence annotation of the proteins as a feature vector in order to establish that they belong to the same class. You only need to compare each element of each sequence, one by one, and compute the similarity. For the same reason, the estimation of the similarity does not require the two proteins to have the same length.
In this example, we do not have to assign a numerical value to each element of the annotation. Let's consider an element of the protein annotation as its character c and position p (for example, K, 4). The dot product of the two protein annotations x and x' of the respective lengths n and n' are defined as the number of identical elements (character and position) between the two annotations divided by the maximum length between the two annotations (M1):
The computation of the similarity for the three proteins produces the result as sim(x,x')=6/12 = 0.50, sim(x,x'')=3/13 =0.23, and sim(x',x'')= 4/13= 0.31.
Another similar aspect is that the similarity of two identical annotations is 1.0 and the similarity of two completely different annotations is 0.0.
The visualization of similarity
It is usually more convenient to use a radial representation to visualize the similarity between features, as in the example of proteins' annotations. The distance d(x,x') = 1/sim(x,x') is visualized as the angle or cosine between two features. The cosine metric is commonly used in text mining.
In this example, the similarity is known as a kernel function in the space of the sequence annotation of proteins.
Although the measure of similarity is very useful to understand the concept of a kernel function, kernels have a broader definition. A kernel K(x, x') is a symmetric, nonnegative real function that takes two real arguments (values of two features). There are many different types of kernel functions, among which the most common are as follows:
The simple linear model for regression consists of the dot product of the regression parameters (weights) and the input data (refer to the Ordinary least squares regression section in Chapter 6, Regression and Regularization).
The model is, in fact, the linear combination of weights and linear combination of inputs. The concept can be extended by defining a general regression model as the linear combination of nonlinear functions, known as basis functions (M2):
The most commonly used basis functions are the power and Gaussian functions. The kernel function is described as the dot product of the two vectors of the basis function φ(x).φ(x') of two features vectors x and x'. A partial list of kernel methods is as follows:
M3: The generic kernel function is defined as:
M4: The linear kernel is defined as:
M5: The polynomial kernel with the slope γ, degree n, and constant c is defined as:
M6: The sigmoid kernel with the slope γ and constant c is defined as:
M7: The radial basis function kernel with the slope γ is defined as:
M8: The Laplacian kernel with the slope γ is defined as:
M9: The log kernel with the degree n is defined as:
The list of discriminative kernel functions described earlier is just a subset of the kernel methods' universe. The other types of kernels include the following:
The kernel functions play a very important role in support vector machines for nonlinear problems.
The concept of a kernel function is actually derived from differential geometry and more specifically from manifold, which was introduced in the Non-linear models section under Dimension reduction in Chapter 4, Unsupervised Learning.
A manifold is a low dimension features space embedded in the observation space of higher dimension. The dot (or inner) product of two observations, known as the Riemann metric, is computed on a Euclidean tangent space.
The kernel function is the composition of the dot product on the tangent space projected on the manifold using an exponential map, as shown in the following diagram:
A kernel function is the composition g o f of two functions:
The KF
class implements the kernel function as a composition of the functions g and h:
type F1 = Double => Double type F2 = (Double, Double) => Double case class KF[G](val g: G, h: F2) { def metric(v: DblVector, w: DblVector) (implicit gf: G => F1): Double = //1 g(v.zip(w).map{ case(_v, _w) => h(_v, _w)}.sum) //2 }
The KF
class is parameterized with a G
type that can be converted to Function1[Double, Double]
. Therefore, the computation of metric
(dot product) requires an implicit conversion from G
to Function1
(line 1
). The metric
is computed by zipping the two vectors, mapping the h
similarity function, and summing up the resulting vector (line 2
).
Let's define the monadic composition for the KF
class:
val kfMonad = new _Monad[KF] { override def map[G,H](kf: KF[G])(f: G =>H): KF[H] = KF[H](f(kf.g), kf.h) //3 override def flatMap[G,H](kf: KF[G])(f: G =>KF[H]): KF[H] = KF[H](f(kf.g).g, kf.h) }
The creation of the kfMonad
instance overrides the map
and flatMap
methods defined in the generic _Monad
trait, as described in the Monads section in Chapter 1, Getting Started. The implementation of the unit
method is not essential to the monadic composition and it is, therefore, omitted.
The function argument of the map
and flatMap
methods applies only to the exponential map function g (line 3
). The composition of two kernel functions kf1 = g1 o h and kf2 = g2 o h produces a kernel function kf3 = g2 o (g1 o h) = (g2 o g1) o h = g3 o h.
Interpretation of kernel functions' monadic composition
The visualization of the monadic composition of kernel functions on the manifold is quite intuitive. The composition of two kernel functions consists of composing their respective projections or exponential map functions g. The function g is directly related to the curvature of the manifold around the data point for which the metric is computed. The monadic composition of the kernel functions attempts to adjust the exponential map to fit the curvature of the manifold.
The next step is to define an implicit class to convert a kernel function of the KF
type to its monadic representation so that it can access the map
and flatMap
methods (line 4
):
implicit class kF2Monad[G](kf: KF[G]) { //4 def map[H](f: G =>H): KF[H] = kfMonad.map(kf)(f) def flatMap[H](f: G =>KF[H]): KF[H] =kfMonad.flatMap(kf)(f) }
Let's implement the RBF
radial basis function and the polynomial kernel function, Polynomial
, by defining their respective g and h functions. The parameterized type for the kernel function is simply Function1[Double, Double]
:
class RBF(s2: Double) extends KF[F1]( (x: Double) => Math.exp(-0.5*x*x/s2), (x: Double, y: Double) => x -y) class Polynomial(d: Int) extends KF[F1]( (x: Double) => Math.pow(1.0+x, d), (x: Double, y: Double) => x*y)
Here is an example of the composition of two kernel functions: a kf1
kernel RBF with a standard deviation of 0.6
(line 5
) and a kf2
polynomial kernel with a degree 3
(line 6
):
val v = Vector[Double](0.5, 0.2, 0.3) val w = Vector[Double](0.1, 0.7, 0.2) val composed = for { kf1 <- new RBF(0.6) //5 kf2 <- new Polynomial(3) //6 } yield kf2 composed.metric(v, w) //7
Finally, the metric
is computed on the composed
kernel functions (line 7
).