In the previous chapter, we saw algorithms for exact inference on graphical models. The computational complexity of calculating exact inference is exponential to the tree width of the network. Hence, for much larger networks whose tree width is large, exact inference becomes infeasible. Also, in many of our real-life problems, we are not particularly concerned about the exact probabilities of random variables. Rather, we are much more interested in the relative probabilities of the states of variables. Therefore, in this chapter, we will discuss algorithms to perform approximate inference over networks. There are many algorithms for approximate inference, but the approach to find an approximate distribution remains the same in all of them. In most of these, we usually define a target class Q of easy distributions, and then from this class, we try to find the distribution that is closest to our actual distribution and answer inference queries from this estimated distribution.
In this chapter, we will discuss:
Let's start with a little recap of exact inference. Assume that we have a factorized distribution in the following form:
Here, Z is the partition function, are the factors in the network, and is the scope of the factor . In the case of exact inference, we computed and then answered queries over this distribution.
In the case of belief propagation, the end result of running the algorithm was a set of beliefs on the clusters and sepsets. This set of beliefs was able to represent the joint distribution . So, in the case of exact inference, we tried to find a set of calibrated beliefs that was able to represent our joint distribution exactly. For approximate algorithms, we will try to select the set of beliefs from all the sets of beliefs that conform to the cluster tree and are best able to represent our original distribution .
So now the question is, how do we compare the similarity between these two distributions? There are many methods that we can use to compute the relative similarity of the two distributions, for example, Euclidean distance, distance, and relative entropy. However, the problem with most of these methods is that we need to answer hard queries on to compute the distance, and the whole purpose of approximate inference is to avoid computing the exact joint distribution. By using relative entropy to measure the similarity between the distributions, we can avoid answering hard queries on . Now, let's see how relative entropy is defined over distributions.
The relative entropy between two distributions and is defined as follows:
The relative entropy is always non-negative and is 0 only when . Also, the relative entropy is a nonsymmetrical quantity, so .
Now, in our case of approximate inference, we will use (not ) because computing it also requires computing ). Then, we can find the value of Q, which minimizes .
Summarizing our complete optimization problem, let's assume that we have a cluster tree T for a distribution and are given following the set of beliefs:
Here, denotes the clusters in T, denotes beliefs over , and denotes beliefs over . This set of beliefs represents a distribution Q as follows:
As the cluster tree is calibrated, it satisfies the marginal consistency constraints and therefore for each are the marginals of and . Therefore, the set of calibrated beliefs Q satisfies the following equations:
Now, we can define our optimization problem by selecting Q from the space of calibrated sets Q:
This must be done such that it minimizes with the following constraints:
To solve this optimization problem, we examine the different configurations of beliefs that satisfy the marginal consistency constraints and select the one that minimizes our objective entropy function .