Markov Chain Monte-Carlo

MCMC methods have their origin in physics with the work of Metropolis, Ulam, and Rosenbluth. It was in the 1980s that they began to have a significant impact on statistics. Many MCMC algorithms and methods have been proposed and they are among the most successful approaches to computing posterior distributions.

If we use the word framework and not algorithm, it is because there is no single MCMC algorithm; instead, there are many. Multiple strategies are possible to implement it based on the problem we need to solve.

Monte-Carlo has been used for more than half a century to solve many complicated estimation problems. However, its main weakness was, as in rejection and importance sampling, its convergence in high-dimensional problems.

So Markov Chains were used from the start to estimate the convergence and stability of those methods. But it wasn't until recently (the1980s and 1990s) that they started to be massively used in statistical estimation.

General idea of the method

Markov Chain Monte-Carlo methods are based on the same idea as previously, where we have a complex distribution p(x) and a proposal distribution q(x). However, in this case, the state of the variable x is kept along the way and the proposal distribution depends on this current state, that is we sample from q(x | xt-1). This sequence of x forms a Markov Chain.

The main idea is that at each step of the algorithm we draw a sample from q(x | xt-1) and accept the sample upon certain criteria. If the sample is accepted then the parameters of q are updated according to the new sampled value and we start again, otherwise a new sample is drawn without changing q. So we want to choose a simple q distribution to make it efficient.

As before, the problem we want to solve is the expectation of a function with respect to a complex distribution:

General idea of the method

We assume that we can evaluate p(x) or at least we can evaluate it up to a normalizing constant General idea of the method. In order to solve the problem of sampling, the Metropolis-Hastings algorithm (proposed by Metropolis in 1953 and Hastings in 1970) gives a generic way to construct a Markov Chain that is ergodic and stationary with respect to the distribution p(x).

In other words, it means that if xt ~ p(x) then xt+1 ~ p(x) and therefore the Markov Chain will converge to the distribution p(x).

The principle of MCMC algorithms is somehow contrary to the principle of rejection and importance sampling in the sense that, instead of aiming directly at the big picture with the proposal distribution, it tries to explore the space of p(x), with a simpler distribution.

Let me give you an analogy, once explained by Professor Christian Robert from the University of Paris-Dauphine, France, and the University of Warwick, UK.

Imagine you are a visitor in a museum and suddenly there is a blackout. The gallery is completely dark. Your only way to look at the paintings is to use a small torch. The beam of the torch is very narrow so at any time you will only see a small part of the painting. But you can move your torch along the painting until you discover all of it. Then you will have the big picture. Of course, you can argue that a painting is more than the sum of its parts, but that's another story.

The Metropolis-Hastings algorithm

This algorithm will build a sequence of sampled values xt, such that this sequence will converge to p(x). So the chain of values is a sample of p(x) and these values are approximately distributed according to p(x). However, at the beginning, and because each value is dependent on its previous value, the first samples are also very dependent on the initial value x0. It is therefore recommended not to use the initial values and to give a period of warm-up to the algorithm.

We recall a similar result in the previous algorithm. Based on the Markov Chain, it is possible to show that, even if it's hard to determine when the algorithm will reach stationarity, the average of the sampled values will converge almost surely to E(f), the empirical average defined by:

The Metropolis-Hastings algorithm

This will converge almost surely to E(f). We recall that almost sure convergence is defined as a sequence X1, X2, …, Xn of random variables converging to a random variable X if:

The Metropolis-Hastings algorithm

Of course, we can't feasibly sample an infinite sequence of variables, but we have a guarantee it will converge. So in theory we know it will converge and therefore sampling from the Markov chain is equivalent to iid sampling from the distribution. In practice, we need to sample a lot to get good results.

The Metropolis-Hastings algorithm works as follows:

  1. Draw a value xt ~ q(xt | xt-1) where q(x) is our simple proposal distribution.
  2. Compute the probability of acceptance with:
    The Metropolis-Hastings algorithm
  3. Take xt with probability ρ(xt, xt-1) and xt-1 with probability 1– ρ(xt, xt-1).

Given the choice of q(x), this algorithm will preserve the stationarity of the distribution p(x) in the Markov chain. In theory, again, this algorithm is guaranteed to converge for an arbitrary distribution q(x). This is an impressive result, hence the popularity of this algorithm. However, in practice, things are a bit more complicated because the convergence might happen very late in the process, if for example the proposal distribution q(x) is too narrow. On the other hand, a too large distribution might end up with a very unstable algorithm. It will still converge but with a huge step and can miss the most important part of the original distribution p(x) by leaving the area with high probability mass too early.

The sequence of samples xt represents a random walk and we can illustrate the previous problem of practical convergence by looking at what happens to such a trajectory in a simple example.

We want to sample from a two-dimensional Gaussian distribution and we use a smaller two-dimensional Gaussian as the proposal distribution. In order to deal with multi-dimensional Gaussian, we need the MASS package:

library(MASS)bigauss <- mvrnorm(50000, mu = c(0, 0), Sigma = matrix(c(1, .1, .1, 1), 2))
bigauss.estimate <- kde2d(bigauss[,1], bigauss[,2], n = 50)
contour(bigauss.estimate,nlevels=6,lty=2)

This code snippet plots a simple two-dimensional Gaussian distribution as shown next. From there, we will use a smaller Gaussian distribution whose next mean value will be drawn from the current small Gaussian distribution. This is not yet an application of the Metropolis-Hastings algorithm, but just an example to visualize what happens with different covariance to the random walk and where it can go in just a few iterations.

The starting point of the random walk is arbitrarily in the center of the big Gaussian distribution, that is, at coordinates (0,0) in this case:

The Metropolis-Hastings algorithm

The following figure shows the small proposal distribution in red contour lines:

The Metropolis-Hastings algorithm

The following code draws random samples from the small distribution in the center and updates its center with the previous value:

L <- 10
smallcov <- matrix(c(.1,.01,.01,.1),2)
x <- c(0,0)
for(i in 1:L)
{
  x2 <- mvrnorm(1, mu=x, Sigma=smallcov)
  lines(c(x[1],x2[1]), c(x[2],x2[2]), t='p',pch=20)
  x <- x2
}

We show three examples, with 10, 100, and 1,000 points. Obviously, a pure random walk is completely off the initial big distribution after a few iterations. As for 1,000 iterations, they are totally off:

The Metropolis-Hastings algorithm

Now, let's complete the picture and implement the last bit of the Metropolis-Hastings algorithm in R. As we assumed that we can estimate p(x) on our target distribution, we will use here the dmvnorm function from the mvtnorm package.

To illustrate the behavior of this algorithm, we will take a simple Gaussian distribution as p(x) and an even simpler uniform distribution as q(x).

So the proposal distribution is simply:

The Metropolis-Hastings algorithm

And p(x) = N(0,1), a simple Gaussian distribution:

p = function(x)
{
  dnorm(x,0,1)
}

mh = function(x,alpha)
{
  xt <- runif(1,x-alpha,x+alpha)
  if( runif(1) > p(xt) / p(x) )
    xt <- x

  return(xt)
}

sampler = function(L,alpha)
{
  x <- numeric(L)
  for(i in 2:L)
    x[i] <- mh(x[i-1],alpha)

  return(x)
}

par(mfrow=c(2,2))
for(l in c(10,100,1000,10000))
{
  hist(sampler(l,1),main=paste(l,"iterations"),breaks=50,freq=F,xlim=c(-4,4),ylim=c(0,1))
  lines(x0,p(x0))
}

The first function is the evaluation of p(x). Then it is followed by the Metropolis-Hastings step using the proposal uniform distribution defined before.

Then we implement the sampler, which takes two parameters: L is the number of iterations and alpha the width of the uniform distribution. The bigger alpha is, the larger the area this proposal distribution will cover. Too large a value will result in too many jumps and poor results. Too small a value will have a serious impact on the numerical convergence of the algorithm, even if we know that theoretically it will converge.

Then the last part of the code draws some results for 10, 100, 1,000, and 10,000 iterations. Running this code, we obtain the following graph. Your results will be different because we draw numbers at random, of course, but the overall plot is similar:

The Metropolis-Hastings algorithm

It is clear that, with only 10 iterations, the results are poor. After a warm-up period of 100 iterations, it seems we are close to the mean value, but the variance of such a dataset will be completely off with respect to p(x). After 1,000 iterations, our dataset begins to be quite close to the target distribution. Finally, after 10,000 iterations, we have a superb histogram.

Next, we vary alpha, with 1,000 iterations. We use the following code:

par(mfrow=c(2,2))
for(a in c(0.1,0.5,1,10))
{
        hist(sampler(1000,a),main=paste("alpha=",a),breaks=50,freq=F,xlim=c(-4,4),ylim=c(0,1))
        lines(x0,p(x0))
}
The Metropolis-Hastings algorithm

Again, we see interesting results: the theory says it will converge. The practice needs a bit of tuning first. It seems that alpha=0.5 or alpha=1 are large enough to cover the distribution. However, alpha=0.1 is too narrow and can't explore the space fast enough in only 1,000 iterations. On the other hand, alpha=10 gives a bi-modal distribution; the jumps are too big.

If we run the same experiment with many more iterations, such as 50,000, we see a stabilization of the algorithm and most of the proposal distribution seem to converge to the ideal solution. Again, alpha=0.1 and alpha=10 seem a bit weak, but the overall result is more than acceptable this time:

The Metropolis-Hastings algorithm
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset