The junction tree algorithm

In this section we will have an overview of the main algorithm in probabilistic graphical models. It is called the junction tree algorithm. The name arises from the fact that, before performing numerical computations, we will transform the graph of the probabilistic graphical model into a tree with a set of properties that allow the efficient computation of posterior probabilities.

One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution.

In order to achieve such a result, the junction tree algorithm will combine the efficiency of belief propagation and the sum-product as we saw before and the generality of the variable elimination procedure. Indeed, variable elimination works on any type of tree (but not on graphs with loops) and the sum-product caches intermediary results in order to make the computations more efficient. Because variable elimination only works on trees, we need to transform a graph with loops into a tree representing an equivalent factorization of the distribution.

The junction tree algorithm is based on the following consideration. Let's take our favorite example again P(A, B, C, D) = P(A)P(B | A)P(C | B)P(D | C) and apply the Bayes rule to each factor:

The junction tree algorithm

This formula is very interesting because we use as the denominator the variables in the intersection of the sets {A,B},{B,C} and {B,C},{C,D}. This reparameterization of the initial factorization is a prime indicator of how to transform a graph and how to perform inference on this transformed graph. Indeed, P(B) and P(D) are the probability distribution of the separator between the aforementioned sets—that is, the intersection of some clusters of the variable.

Of course, this is not the general principle but a useful observation for building a tree from a graph and performing inference.

The building of a junction tree will go through four phases in order to transform the graph into a tree:

  1. Moralization of the graph, which consists in joining each pair of parents of each node with an undirected edge:
    The junction tree algorithm
  2. Then the graph is transformed into an undirected graph where each arrow is replaced by a simple edge. The effect of the first two operations is that each variable (a node in the graph) and its parent are now in the same clique—that is, a subgraph where the nodes are all connected.
  3. Then the graph is triangulated: when a graph has loops, the results from variable elimination and re-representation in terms of the induced graph is equivalent to adding an edge between two variables belonging to the same undirected loop. We saw a simple example before: we eliminated variable A and obtained a new graph. When a graph has a loop this elimination step is equivalent to adding a cord between two nodes and we therefore need to perform it in the graph before the next step. In the next graph, the dashed lines are due to triangulation, while the plain lines are due to the two previous steps:
    The junction tree algorithm
  4. Finally, the last step will transform the triangulated graph into a cluster tree in which each node represents a factor on a subset of variables. The subset is determined by each clique of the graph. In between each cluster node, we will have another type of node called the separator. Recall the first simple example we saw at the beginning of this section, in which we reparameterized the model by putting a the denominator all the separators: this is exactly what we're performing here, but on any type of graph. The cluster tree is calculated by:
    • Finding each clique of the triangulated graph and joining the nodes from those cliques into a single node.
    • Computing a maximum spanning tree on the graph. The junction tree is this maximum spanning tree.

So from the cluster tree we obtain at the end, which is also called a junction tree, we have two types of nodes: cluster nodes and separator nodes. More generally, in the same spirit as our initial example, the probability distribution of a junction tree is equal to:

The junction tree algorithm where φ(c) is a factor on each cluster of the junction tree and φ(s) is a factor on each separator of the junction tree. Let's look at the full transformation from an example in Bayesian Reasoning and Machine Learning, D. Barber, Cambridge University Press, 2012.

The initial graph is as follows:

The junction tree algorithm

Now, the triangulated and undirected graph based on this initial graph is as follows:

The junction tree algorithm

Finally, the junction tree is as follows:

The junction tree algorithm

Inference on the junction is realized by passing messages from one cluster to the next one in two passes: from the root to the bottom and then the bottom to the root. After this full schedule of updates between clusters, each cluster will contain the posterior probability distribution of the variables it contains (such as for example P(ABC) in the top node in our example). Finally, finding the posterior distribution of any variables boils down to applying the Bayes rule on one of these clusters and marginalizing out the few variables we are not interested in.

Implementing a junction tree algorithm is a complex task, but fortunately several R packages contain a full implementation. And you already used them. Indeed, in the first chapter we saw small examples of Bayesian inference with the gRain package. The inference algorithm is the junction tree algorithm.

So as an exercise, we will build on and experiment with one of our previous examples in which we have variables A, B, C, D, E, and F. We will consider for the sake of simplicity that each variable is binary so that we won't have too many values to deal with. We will assume the following factorization:

P(ABCDEF) = P(F).P(C|F).P(E|F).P(A|C).P(D|E).P(B|A,D)

This is represented by the following graph:

The junction tree algorithm

We first start by loading the gRain package into R:

library(gRain)

And then we create our set of random variables from A to F:

val=c("true","false")
F = cptable(~F, values=c(10,90),levels=val)
C = cptable(~C|F, values=c(10,90,20,80),levels=val)
E = cptable(~E|F, values=c(50,50,30,70),levels=val)
A = cptable(~A|C, values=c(50,50,70,30),levels=val)
D = cptable(~D|E, values=c(60,40,70,30),levels=val)
B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val)

As you may remember, the cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example.

Because we have been giving the parents of each variable every time we created a conditional probability table, we also have defined our graph completely. Therefore, the next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once:

Here we will perform the following:

plist = compileCPT(list(F,E,C,A,D,B))
plist

We check the list of variables has been correctly compiled into a probabilistic graphical model and we obtain it from the previous code:

CPTspec with probabilities:
 P( F )
 P( E | F )
 P( C | F )
 P( A | C )
 P( D | E )
 P( B | A D )

This is indeed the factorization of our distribution as stated before. If we want to check further we can look at the conditional probability table of a few variables:

print(plist$F)
print(plist$B)

The result is, as expected, the conditional probability table:

F
 true false
0.1   0.9
, , D = true

       A
B       true false
  true   0.6   0.7
  false  0.4   0.3

, , D = false

       A
B       true false
  true   0.2   0.1
  false  0.8   0.9

The second output is a bit more complex but if you look carefully you will see that you have two distributions: P(B|A,D=true) and P(B|A,D=false), which is a more readable presentation of P(B|A,D).

We finally create the graph and run the junction tree algorithm by calling:

jtree = grain(plist)

Again, check the result we obtain:

jtree
Independence network: Compiled: FALSE Propagated: FALSE
  Nodes: chr [1:6] "F" "E" "C" "A" "D" "B"

At this point, you might think, "Is that all?" Well, yes it is. Now that you have the junction tree representation of the graph you can perform any possible inference. And moreover, you only need to compute the junction tree once. Then all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree.

Let's perform a few queries:

querygrain(jtree, nodes=c("F"), type="marginal")
$F
F
 true false
0.1   0.9

Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table, because F has no parents. At least we know it works!

 querygrain(jtree, nodes=c("C"), type="marginal")
$C
C
 true false
0.19  0.81

This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn't need to have a complex algorithm such as the junction tree algorithm to compute such a small marginal. The variable elimination algorithm we saw earlier would be enough, too.

But if you ask for the marginal of B then variable elimination will not work because of the loop in the graph. However the junction tree will give the following:

 querygrain(jtree, nodes=c("B"), type="marginal")
$B
B
    true    false
0.478564 0.521436

And we can ask for a more complex distribution, such as the joint distribution of B and A:

querygrain(jtree, nodes=c("A","B"), type="joint")
       B
A           true    false
  true  0.309272 0.352728
  false 0.169292 0.168708

In fact, any combination can be given such as A,B,C:

querygrain(jtree, nodes=c("A","B","C"), type="joint")
, , B = true

       A
C           true    false
  true  0.044420 0.047630
  false 0.264852 0.121662

, , B = false

       A
C           true    false
  true  0.050580 0.047370
  false 0.302148 0.121338

Now we want to observe a variable and compute the posterior distribution. Let's say F=true and we want to propagate down this information to the rest of the network:

jtree2 = setEvidence(jtree, evidence=list(F="true"))

And we query the network again:

querygrain(jtree, nodes=c("F"), type="marginal")
$F
F
 true false
0.1   0.9
querygrain(jtree2, nodes=c("F"), type="marginal")
$F
F
 true false
    1     0

This query is most interesting: in the first query in jtree we have the marginal of F and in the second query in jtree2 we have … P(F=true) = 1!!! Indeed, we set an evidence in the network saying that F=true. So the probability is now 1 for this value.

More interestingly, we can ask for any joint or marginal now:

querygrain(jtree, nodes=c("A"), type="marginal")
$A
A
 true false
0.662 0.338

querygrain(jtree2, nodes=c("A"), type="marginal")
$A
A
 true false
 0.68  0.32

Here we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence).

And we can query any other variable (and see that the results are different):

querygrain(jtree, nodes=c("B"), type="marginal")
$B
B
    true    false
0.478564 0.521436

querygrain(jtree2, nodes=c("B"), type="marginal")
$B
B
  true  false
0.4696 0.5304

Finally, we can set more evidences and propagate back and forth in the network to compute inverse probabilities as well:

jtree3 = setEvidence(jtree, evidence=list(F="true",A="false"))

Here we say that F=true and A=false and query the network again, looking at the difference between the before and after setting evidences:

querygrain(jtree, nodes=c("C"), type="marginal")
$C
C
 true false
 0.19  0.81

querygrain(jtree2, nodes=c("C"), type="marginal")
$C
C
     true     false
0.0989819 0.9010181

querygrain(jtree3, nodes=c("C"), type="marginal")
$C
C
   true   false
0.15625 0.84375

As expected, knowing a value for A and F drastically changes the probability distribution of C. As an exercise, I let the reader put an evidence of F (and then F and B) to see what happens to the posterior distribution of A.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset