Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Machine Learning Using Bayesian Inference

Now that we have learned about Bayesian inference and R, it is time to use both for machine learning. In this chapter, we will give an overview of different machine learning techniques and discuss each of them in detail in subsequent chapters. Machine learning is a field at the intersection of computer science and statistics, and a subbranch of artificial intelligence or AI. The name essentially comes from the early works in AI where researchers were trying to develop learning machines that automatically learned the relationship between input and output variables from data alone. Once a machine is trained on a dataset for a given problem, it can be used as a black box to predict values of output variables for new values of input variables.

It is useful to set this learning process of a machine in a mathematical framework. Let X and Y be two random variables such that we seek a learning machine that learns the relationship between these two variables from data and predicts the value of Y, given the value of X. The system is fully characterized by a joint probability distribution P(X,Y), however, the form of this distribution is unknown. The goal of learning is to find a function f(X), which maps from X to Y, such that the predictions contain as small error as possible. To achieve this, one chooses a loss function L(Y, f(X)) and finds an f(X) that minimizes the expected or average loss over the joint distribution of X and Y given by:

In Statistical Decision Theory, this is called empirical risk minimization. The typical loss function used is square loss function (), if Y is a continuous variable, and Hinge loss function (), if Y is a binary discrete variable with values . The first case is typically called regression and second case is called binary classification, as we will see later in this chapter.

The mathematical framework described here is called supervised learning, where the machine is presented with a training dataset containing ground truth values corresponding to pairs (Y, X). Let us consider the case of square loss function again. Here, the learning task is to find an f(X) that minimizes the following:

Since the objective is to predict values of Y for the given values of X, we have used the conditional distribution P(Y|X) inside the integral using factorization of P(X, Y). It can be shown that the minimization of the preceding loss function leads to the following solution:

The meaning of the preceding equation is that the best prediction of Y for any input value X is the mean or expectation denoted by E, of the conditional probability distribution P(Y|X) conditioned at X.

In Chapter 3, Introducing Bayesian Inference, we mentioned maximum likelihood estimation (MLE) as a method for learning the parameters of any distribution P(X). In general, MLE is the same as the minimization of a square loss function if the underlying distribution is a normal distribution.

Note that, in empirical risk minimization, we are learning the parameter, E[(Y|X)], the mean of the conditional distribution, for a given value of X. We will use one particular machine learning task, linear regression, to explain the advantage of Bayesian inference over the classical method of learning. However, before this, we will briefly explain some more general aspects of machine learning.

There are two types of supervised machine learning models, namely generative models and discriminative models. In the case of generative models, the algorithm tries to learn the joint probability of X and Y, which is P(X,Y), from data and uses it to estimate mean P(Y|X). In the case of discriminative models, the algorithm tries to directly learn the desired function, which is the mean of P(Y|X), and no modeling of the X variable is attempted.

Labeling values of the target variable in the training data is done manually. This makes supervised learning very expensive when one needs to use very large datasets as in the case of text analytics. However, very often, supervised learning methods produce the most accurate results.

If there is not enough training data available for learning, one can still use machine learning through unsupervised learning. Here, the learning is mainly through the discovery of patterns of associations between variables in the dataset. Clustering data points that have similar features is a classic example.

Reinforcement learning is the third type of machine learning, where the learning takes place in a dynamic environment where the machine needs to perform certain actions based on its current state. Associated with each action is a reward. The machine needs to learn what action needs to be taken at each state so that the total reward is maximized. This is typically how a robot learns to perform tasks, such as driving a vehicle, in a real-life environment.

Why Bayesian inference for machine learning?

We have already discussed the advantages of Bayesian statistics over classical statistics in the last chapter. In this chapter, we will see in more detail how some of the concepts of Bayesian inference that we learned in the last chapter are useful in the context of machine learning. For this purpose, we take one simple machine learning task, namely linear regression. Let us consider a learning task where we have a dataset D containing N pair of points and the goal is to build a machine learning model using linear regression that it can be used to predict values of , given new values of .

In linear regression, first, we assume that Y is of the following form:

Here, F(X) is a function that captures the true relationship between X and Y, and is an error term that captures the inherent noise in the data. It is assumed that this noise is characterized by a normal distribution with mean 0 and variance . What this implies is that if we have an infinite training dataset, we can learn the form of F(X) from data and, even then, we can only predict Y up to an additive noise term . In practice, we will have only a finite training dataset D; hence, we will be able to learn only an approximation for F(X) denoted by .

Note that we are discussing two types of errors here. One is an error term that is due to the inherent noise in the data that we cannot do much about. The second error is in learning F(X), approximately through the function from the dataset D.

In general, , which the approximate mapping between input variable X and output variable Y, is a function of X with a set of parameters . When is a linear function of the parameters , we say the learning model is linear. It is a general misconception that linear regression corresponds to the case only if is a linear function of X. The reason for linearity in the parameter and not in X is that, during the minimization of the loss function, one actually minimizes over the parameter values to find the best . Hence, a function that is linear in will lead to a linear optimization problem that can be tackled analytically and numerically more easily. Therefore, linear regression corresponds to the following:

This is an expansion over a set of M basis functions . Here, each basis function is a function of X without any unknown parameters. In machine learning, these are called feature functions or model features. For the linear regression problem, the loss function, therefore, can be written as follows:

Why Bayesian inference for machine learning?

Here, is the transpose of the parameter vector and B(X) is the vector composed of the basis functions . Learning from a dataset implies estimating the values of by minimizing the loss function through some optimization schemes such as gradient descent.

It is important to choose as many basis functions as possible to capture interesting patterns in the data. However, choosing more numbers of basis functions or features will overfit the model in the sense that it will even start fitting the noise contained in the data. Overfit will lead to poor predictions on new input data. Therefore, it is important to choose an optimum number of best features to maximize the predictive accuracy of any machine learning model. In machine learning based on classical statistics, this is achieved through what is called bias-variance tradeoff and model regularization. Whereas, in machine learning through Bayesian inference, accuracy of a predictive model can be maximized through Bayesian model averaging, and there is no need to impose model regularization or bias-variance tradeoff. We will learn each of these concepts in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Machine Learning Using Bayesian Inference

Create new playlist

Sign In

Sign Up

Chapter 4. Machine Learning Using Bayesian Inference

Why Bayesian inference for machine learning?

Table of Contents for
4. Machine Learning Using Bayesian Inference