In the Gibbs sampling algorithm, we start by reducing all the factors with the observed variables. After this, we generate a sample for each unobserved variable on the prior using some sampling method, for example, by using a mutilated Bayesian network. After generating the first sample, we iterate over each of the unobserved variables to generate a new value for a variable, given our current sample for all the other variables.
Let's take the example of our restaurant model to make this clearer. Assume that we have already observed that the cost of the restaurant is high. So, we will have the CPDs: . We start by generating our first sample with forward sampling, and let's say our first samples are and . We will now iterate over all of our unobserved variables N, L, Q. Starting with N, we will sample it from the distribution . As we are computing the distribution over a single variable, we can compute it very easily as follows:
Now that we have sampled from the distribution , we continue with the iteration and sample L by conditioning the distribution with the new sample value of N, . Similarly, we go on generating samples.
The thing to notice here is that unlike forward sampling, when sampling here we are taking into consideration the evidences, although this method will not give the true posterior as we began sampling from the prior distribution. Yet, considering the evidence, we are able to generate samples that are much closer to the posterior, and the repetition of this method enables us to keep generating samples that get closer to the posterior distribution.
In the later sections, we will formalize this concept using the Markov chain Monte Carlo method. Using this method, we will be able to generate samples that will be much closer to the posterior distribution.
In the case of graphical models, Markov chains are a graph of states of variables X, and the edges represent the probability of transitioning from one state to another. So, an edge represents the probability of transitioning from the state x to , represented by :
Let's take the example of a drunk man walking along a road. The position of the person on the road can be represented by a random variable. Let's say the person started at point 0 and can go ahead to +4 or go behind to -4, but there are walls beyond this point, so even if he tries to go beyond these points he will stay at the same point. Also, the probability of going either forward or backward is 0.4 and the probability of staying in the same position is 0.2, that is, , and respectively. Also, as the road is blocked by the walls.
We can consider the position of the man at any given time t to be a random variable represented by . This can be computed as follows:
Putting the earlier equation in words, we can say that the probability of the person being at point at some time (t + 1) is equal to the sum over all the states of the product of that person being in that state x and then transitioning to state from x.
Let's now try computing a few probability values for the man's position. We know that the man started from the point 0, so at time t = 0, . Now, at time t = 1, the probability of the man being at point 0 is , and the probability of being at +1 or -1 is . Moving on, at time t = 2, the probability of the man being at point 0 is , point +1 or -1 is , and point +2 or -2 is . We can now see that the probability of being at different states spreads with each time instance, and finally, we will reach a uniform distribution.
To sample from the Markov chain, we can simply select states at each instant of time using the distribution for that instance. However, Markov chains are not a very good method if we want to sample from a uniform distribution, because for the range , it takes on average steps to reach the uniform distribution. So now, let's try to find out when a Markov chain converges and what the distribution on convergence is.
To make the computation simpler, let's take an example of a similar, but much smaller network, as shown in Fig 4.21:
At equilibrium, we can say that for any state , should almost be equal to :
At equilibrium, the distribution is known as stationary distribution and is represented by . We can easily show this as follows:
Now, let's try to compute the stationary distributions for the Markov chain in Fig 4.21. We can write the following equations:
For this to be a legal distribution, it should also satisfy:
We can now easily solve this set of equations to get the following results:
In this case, we got a unique solution for the distributions, but in general, we cannot guarantee that we will always get a converged distribution. For a finite state Markov chain, we can verify the Markov chain for the following two conditions to check if the distributions converge:
These two conditions are usually sufficient but not necessary to guarantee convergence in the distribution.