Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Kernel Models and Support Vector Machines

This chapter introduces kernel functions, binary support vectors classifiers, one-class support vector machines for anomaly detection, and support vector regression.

In the Binomial classification section of Chapter 6, Regression and Regularization, you learned the concept of hyperplanes used to segregate observations from the training set and estimate the linear decision boundary. The logistic regression has at least one limitation: it requires that the datasets are linearly separated using a defined function (sigmoid). This limitation is especially an issue for high-dimension problems (large number of features that are highly nonlinearly dependent). Support vector machines (SVMs) overcome this limitation by estimating the optimal separating hyperplane using kernel functions.

In this chapter, you will discover the following topics:

The impact of some of the SVM configuration parameters and the kernel method on the accuracy of the classification
How to apply the binary support vector classifier to estimate the risk for a public company to curtail or eliminate its dividend
How the support vector regression compares to the linear regression

Support vector machines are formulated as a convex optimization problem. Therefore, the mathematical foundation of these algorithms is described for reference.

Kernel functions

Every machine learning model introduced in this book so far assumes that observations are represented by a feature vector of a fixed size. However, some real-world applications such as text mining or genomics do not lend themselves to this restriction. The critical element of the process of classification is to define a similarity or a distance between two observations. Kernel functions allow developers to compute the similarity between observations without the need to encode them in feature vectors [8:1].

Overview

The concept of kernel methods may be a bit odd at first to a novice. It is usually better understood by using a concrete example. Let's consider the example of the classification of proteins. Proteins have different lengths and composition, but it does not prevent scientists from classifying them [8:2].

Note

Proteins:

Proteins are polymers of amino acids joined together by peptide bonds. They are composed of a carbon atom bonded to a hydrogen atom, another amino acid, or a carboxyl group.

A protein is represented using a traditional molecular notation to which biochemists are familiar. Geneticists describe proteins in terms of a sequence of characters known as the protein sequence annotation. The sequence annotation encodes the structure and composition of the protein. The following picture illustrates the molecular (left) and encoded (right) representation of a protein:

Sequence annotation of a protein

The classification and the clustering of a set of proteins require the definition of a similarity factor or distance used to evaluate and compare the proteins. For example, the similarity between three proteins can be defined as a normalized dot product of their sequence annotation:

Similarity between the sequence annotations of three proteins

You do not have to represent the entire sequence annotation of the proteins as a feature vector in order to establish that they belong to the same class. You only need to compare each element of each sequence, one by one, and compute the similarity. For the same reason, the estimation of the similarity does not require the two proteins to have the same length.

In this example, we do not have to assign a numerical value to each element of the annotation. Let's represent an element of the protein annotation as its character c and position p (for example: K, 4). The dot product of the two protein annotations x and x' of the respective lengths n and n' can be defined as the number of identical elements (character and position) between the two annotations divided by the maximum length between the two annotations:

The computation of the similarity for the three proteins produces the result as sim(x,x')=6/12 = 0.50, sim(x,x'')=3/13 =0.23, sim(x',x'')= 4/13= 0.31.

Another similar aspect is that the similarity of two identical annotations is 1.0 and the similarity of two completely different annotations is 0.0.

Tip

Visualization of similarity:

It is usually more convenient to use a radial representation to visualize the similarity between features, as in the example of proteins' annotations. The distance d(x,x') = 1/sim(x,x') is visualized as the angle or cosine between two features. The cosine metric is commonly used in text mining.

In this example, the similarity is known as a kernel function in the space of the sequence annotation of proteins.

Common discriminative kernels

Although the measure of similarity is very useful to understand the concept of a kernel function, kernels have a broader definition. A kernel K(x, x') is a symmetric, non-negative real function that takes two real arguments (values of two features). There are many different types of kernel functions, among which the most common are:

The linear kernel (dot product): This is useful in the case of very high-dimensional data where problems can be expressed as a linear combination of the original features
The polynomial kernel: This extends the linear kernel for a combination of features that are not completely linear
The radial basis function (RBF): This is the most commonly applied kernel. It is appropriate where the labeled or target data is noisy and requires some level of regularization
The sigmoid kernel: This is used in conjunction with neural networks
The laplacian kernel: This is a variant of RBF with a higher regularization impact on training data
The log kernel: This is used in image processing

Note

RBF terminology

In this presentation and the library used in its implementation, the radial basis function is a synonym to the Gaussian kernel function. However, RBF also refers to the family of exponential kernel functions that encompasses Gaussian, Laplacian, and exponential functions.

The simple linear model for regression consists of the dot product of the regression parameters (weights) and the input data (refer to the Ordinary least squares (OLS) regression section of Chapter 6, Regression and Regularization).

The model is in fact the linear combination of weights and linear combination of inputs. The concept can be extended by defining a general regression model as the linear combination of nonlinear functions, known as basis functions:

The most commonly used basis functions are the power and Gaussian functions. The kernel function is described as the dot product of the two vectors of the basis function φ(x).φ(x') of two features vector x and x'. A partial list of kernel methods is as follows:

Note

The generic kernel:

The linear kernel:

The polynomial kernel with the slope γ, degree n, and constant c:

The sigmoid kernel with the slope γ and constant c:

The radial basis function kernel with the slope γ:

The laplacian kernel with the slope γ:

The log kernel with the degree n:

The list of discriminative kernel functions described earlier is just a subset of the kernel methods universe. Other types of kernels include:

Probabilistic kernels: These are kernels derived from generative models. Probabilistic models such as Gaussian processes can be used as a kernel function [8:3].
Smoothing kernels: This is the nonparametric formulation, averaging density with the nearest neighbor observations [8:4].
Reproducible Kernel Hilbert Spaces: This is the dot product of finite or infinite basis functions [8:5].

The kernel functions play a very important role in support vector machines for nonlinear problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Kernel Models and Support Vector Machines

Create new playlist

Sign In

Sign Up

Chapter 8. Kernel Models and Support Vector Machines

Kernel functions

Overview

Note

Tip

Common discriminative kernels

Note

Note

Table of Contents for
8. Kernel Models and Support Vector Machines