This chapter introduces kernel functions, binary support vectors classifiers, one-class support vector machines for anomaly detection, and support vector regression.
In the Binomial classification section of Chapter 6, Regression and Regularization, you learned the concept of hyperplanes used to segregate observations from the training set and estimate the linear decision boundary. The logistic regression has at least one limitation: it requires that the datasets are linearly separated using a defined function (sigmoid). This limitation is especially an issue for high-dimension problems (large number of features that are highly nonlinearly dependent). Support vector machines (SVMs) overcome this limitation by estimating the optimal separating hyperplane using kernel functions.
In this chapter, you will discover the following topics:
Support vector machines are formulated as a convex optimization problem. Therefore, the mathematical foundation of these algorithms is described for reference.
Every machine learning model introduced in this book so far assumes that observations are represented by a feature vector of a fixed size. However, some real-world applications such as text mining or genomics do not lend themselves to this restriction. The critical element of the process of classification is to define a similarity or a distance between two observations. Kernel functions allow developers to compute the similarity between observations without the need to encode them in feature vectors [8:1].
The concept of kernel methods may be a bit odd at first to a novice. It is usually better understood by using a concrete example. Let's consider the example of the classification of proteins. Proteins have different lengths and composition, but it does not prevent scientists from classifying them [8:2].
A protein is represented using a traditional molecular notation to which biochemists are familiar. Geneticists describe proteins in terms of a sequence of characters known as the protein sequence annotation. The sequence annotation encodes the structure and composition of the protein. The following picture illustrates the molecular (left) and encoded (right) representation of a protein:
The classification and the clustering of a set of proteins require the definition of a similarity factor or distance used to evaluate and compare the proteins. For example, the similarity between three proteins can be defined as a normalized dot product of their sequence annotation:
You do not have to represent the entire sequence annotation of the proteins as a feature vector in order to establish that they belong to the same class. You only need to compare each element of each sequence, one by one, and compute the similarity. For the same reason, the estimation of the similarity does not require the two proteins to have the same length.
In this example, we do not have to assign a numerical value to each element of the annotation. Let's represent an element of the protein annotation as its character c
and position p
(for example: K, 4). The dot product of the two protein annotations x
and x'
of the respective lengths n
and n'
can be defined as the number of identical elements (character and position) between the two annotations divided by the maximum length between the two annotations:
The computation of the similarity for the three proteins produces the result as sim(x,x')=6/12 = 0.50, sim(x,x'')=3/13 =0.23, sim(x',x'')= 4/13= 0.31.
Another similar aspect is that the similarity of two identical annotations is 1.0 and the similarity of two completely different annotations is 0.0.
Visualization of similarity:
It is usually more convenient to use a radial representation to visualize the similarity between features, as in the example of proteins' annotations. The distance d(x,x') = 1/sim(x,x') is visualized as the angle or cosine between two features. The cosine metric is commonly used in text mining.
In this example, the similarity is known as a kernel function in the space of the sequence annotation of proteins.
Although the measure of similarity is very useful to understand the concept of a kernel function, kernels have a broader definition. A kernel K(x, x') is a symmetric, non-negative real function that takes two real arguments (values of two features). There are many different types of kernel functions, among which the most common are:
The simple linear model for regression consists of the dot product of the regression parameters (weights) and the input data (refer to the Ordinary least squares (OLS) regression section of Chapter 6, Regression and Regularization).
The model is in fact the linear combination of weights and linear combination of inputs. The concept can be extended by defining a general regression model as the linear combination of nonlinear functions, known as basis functions:
The most commonly used basis functions are the power and Gaussian functions. The kernel function is described as the dot product of the two vectors of the basis function φ(x).φ(x') of two features vector x and x'. A partial list of kernel methods is as follows:
The list of discriminative kernel functions described earlier is just a subset of the kernel methods universe. Other types of kernels include:
The kernel functions play a very important role in support vector machines for nonlinear problems.