Chapter 5. Approximate Inference

This chapter introduces a second class of inference algorithms, maybe the most important of all because of their versatility. The approach is completely different from what we have seen until now. Indeed, we saw two classes of algorithms, one based on a pure analytic resolution of the problem by calculating manually the posterior distribution and the other one by using message propagation in a graph. In both cases, the result was computed exactly. In the case of an analytic solution, computing the solution usually boils down to computing a function of the posterior distribution. In the case of a message-passing algorithm, computing the posterior distribution is done step-by-step by propagating messages on a graph. If the graph is not appropriate for this type of algorithm, the computations can be extremely long and often intractable.

However, in many cases, we can trade a bit of accuracy for more speed. This is the main idea of approximate inference. Does it really matter if we are less accurate? Well, it appears that, in many applications, approximate inference is still very accurate. On the other hand, it allows us to deal with more complex models and with many types of distributions, something that is not completely possible with the other approaches.

In this chapter, we will see one important class of algorithms, called sampling algorithms, also known as Monte-Carlo sampling. The main idea of this class of algorithms is to draw at random from the posterior distribution in order to replace complex computations with simple statistics. For example, if we want to compute the posterior mean of a random variable, we can draw many samples at random from its posterior distribution and simply compute the mean of the values we obtained.

Monte-Carlo sampling is what really made the Bayesian revolution possible in science. Before, Bayesian models were hard to compute, if not impossible.

More specifically, we will look at the following algorithms:

  • Rejection and importance sampling, as a basis for many other methods
  • Markov Chain Monte-Carlo and the Metropolis-Hastings algorithm

These two classes of algorithms will cover most of what we need to know about the Monte-Carlo method. However, many new algorithms are regularly being discovered.

Sampling from a distribution

We have a big problem with probabilistic graphical models in general: they are intractable. They quickly become so complex that it is impossible to run anything in a reasonable amount of time. Not to mention learning them. Remember, for a simple algorithm such as EM, we need to compute a posterior distribution at each iteration. If the dataset is big, which is common now, if the model has a lot of dimensions, which is also common, it becomes totally prohibitive. Moreover, we limited ourselves to a small class of distributions, such as multinomial or Gaussian distributions. Even if they can cover a wide range of applications, it's not the case all the time.

In this chapter, we consider a new class of algorithms based on the idea of sampling from a distribution. Sampling here means to draw values of the parameters at random, following a particular distribution. For example, if one throws a dice, one draws a sample from a multinomial distribution, such that six values are possible, with equal probability. The result is a number between 1 and 6. If the die is not fair (say 6 has a higher probability), then it is possible we will obtain more 6s than the other numbers. If we throw the die many times and then calculate the mean value of all the results, we will presumably see a number closer to 6 than 3.

In many situations, we are more interested in some properties of the distribution rather than the distribution itself—for example, its mean or variance. It means that, in many cases, we want to compute an expectation of a function f(x) with respect to a probability distribution p(x). Here, x can be any random variable of any dimension we want. For example, p(x) can be a multivariate Gaussian distribution or a probabilistic graphical model.

In the case of continuous variables, we want to solve the generic problem of evaluating the expectation:

Sampling from a distribution
Sampling from a distribution

When x is discrete, the integral is replaced by a summation.

In the example in the screenshot, the distribution is in red and the function in green. We immediately see there will be many problems sampling from such a complex distribution.

The algorithms presented in this chapter will try to solve many of those problems.

The main idea in sampling is to replace the evaluation of an integral with a simpler sum over a set of samples drawn independently from the distribution p(x). The previous expectation can then be approximated by a finite sum:

Sampling from a distribution

If the samples are drawn from the distribution p(x) then Sampling from a distribution and, similarly, the variance of the estimator is:

Sampling from a distribution

In this method, the problem is to obtain independent samples. It is not always the case and the number of effective samples might be fewer than the number of points drawn from the distribution. However, as seen in the previous formula, the variance of Sampling from a distribution, the estimator does not depend on the dimension of x. It means that, even in high-dimensional problem such as a graphical model, high accuracy can be obtained with a relatively small number of samples.

But as we stated before, the main problem is in particular sampling from the distribution p(x). And it can be hard or even impossible to do it sometimes. When the distribution p(x) is a graphical model in a directed form (such as most of the models we have in this book), the technique to draw a sample is quite simple and called ancestral sampling.

Given an ordering of the variables xi in the graph, from the top of the graph to the bottom, one can draw a sample from the graphical model by sampling successively each variable and then assigning the sampled values at the corresponding variables to sample from its descendant. For example, let's say we have the following graph:

Sampling from a distribution

We start by sampling from p(A) and p(B) independently, then assign the sampled values to A=a and B=b so that we sample from p(C | A=a, B=b). Finally, the last sample is drawn from p(D | C=c). If none of the variables are observed, the procedure is very simple. If, however, one variable is observed, one technique is to keep only the samples that agree with the value of the observed variables. If we have A=a1, for example, then we will keep only the samples in which we have been lucky enough to have A=a1. In this case, not all the samples are usable and the difference between the number of drawn samples and the number of useful samples can be big, because each time a set of samples disagrees with the observed values, it has to be discarded. The probability of a set of samples being accepted decreases with the number of observed nodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset