Belief propagation

In the previous section, we saw that the basic operation of the variable elimination algorithm is the manipulation of the factors. First, we create a factor Belief propagation by multiplying existing factors. Then, we eliminate a variable in Belief propagation to generate a new factor Belief propagation, which is then used to create another factor. From a different perspective, we can say that a factor Belief propagation is a data structure, which takes messages Belief propagation generated by the other factor Belief propagation, and generates a message Belief propagation which is used by the other factor Belief propagation.

Clique tree

Before we go into a detailed discussion of the belief propagation algorithm, let's discuss the graphical model that provides the basic framework for it, the clique tree, also known as the junction tree.

The clique tree (Clique tree) is an undirected graph over a set of factors Clique tree, where each node represents a cluster of random variables and the edges connect the clusters, whose scope has a nonempty intersection. Thus, each edge between a pair of clusters Clique tree and Clique tree is associated with a sepset Clique tree. For each cluster Clique tree, we also define the cluster potential Clique tree, which is the factor representing all the variables present in it.

This can be generalized. Let's assume there is a variable X, such that Clique tree and Clique tree. Then, X is also present in every cluster in the path between Clique tree and Clique tree in Clique tree. This is known as the running intersection property. We can see an example in the following figure:

Clique tree

Fig 3.4: A simple cluster tree with clusters Clique tree and Clique tree. The sepset associated with the edge is Clique tree.

In pgmpy, we can define a clique tree or junction tree in the following way:

# Firstly import JunctionTree class
In [1]: from pgmpy.models import JunctionTree
In [2]: junction_tree = JunctionTree() 

# Each node in the junction tree is a cluster of random variables
# represented as a tuple
In [3]: junction_tree.add_nodes_from([('A', 'B', 'C'), 
                                      ('C', 'D')])
In [4]: junction_tree.add_edge(('A', 'B', 'C'), ('C', 'D'))

In [5]: junction_tree.add_edge(('A', 'B', 'C'), ('D', 'E', 'F'))
        ValueError: No sepset found between these two edges.

Tip

As discussed previously, the junction tree contains undirected edges only between those clusters whose scope has a non empty intersection. So, if we try to add any edge between two nodes whose scope has an empty intersection, it will raise ValueError.

Constructing a clique tree

In the previous section on variable elimination, we saw that an induced graph created by variable elimination is a chordal graph. The converse of it also holds true; that is, any chordal graph can be used as a basis for inference.

We previously discussed chordal graphs, triangulation techniques (the process of constructing a chordal graph that incorporates an existing graph), and their implementation in pgmpy. To construct a clique tree from the chordal graph, we need to find the maximal cliques in it. There are multiple ways of doing this. One of the simplest methods is the maximum cardinality search (which we discussed in the previous section) to obtain maximal cliques.

Then, these maximal cliques are assigned as nodes in the clique tree. Finally, to determine the edges of the clique tree, we use the maximum spanning tree algorithm. We build an undirected graph whose nodes are maximal cliques in Constructing a clique tree, where every pair of nodes Constructing a clique tree, Constructing a clique tree is connected by an edge whose weight is Constructing a clique tree. Then, by applying the maximum spanning tree algorithm, we find a tree in which the weight of edges is at maximum.

The cluster potential for each cluster in the clique tree can be computed as the product of the factors associated with each node of the cluster. For example, in the following figure, Constructing a clique tree (the cluster potential associated with cluster (A, B, C) is computed as the product of P(A), P(C), and Constructing a clique tree. To compute Constructing a clique tree (the cluster potential associated with (B, D, E), we use Constructing a clique tree, P(B), and P(D). P(B) is computed by marginalizing Constructing a clique tree with respect to A and C.

Constructing a clique tree

Fig 3.5: The cluster potential of the clusters present in the clique tree

These steps can be summarized as follows:

  1. Triangulate the graph G over factor Constructing a clique tree to create a chordal graph Constructing a clique tree.
  2. Find the maximal cliques in Constructing a clique tree and assign them as nodes to an undirected graph.
  3. Assign weights to the edges between two nodes of the undirected graph as the numbers of elements in the sepset of the two nodes.
  4. Construct the clique tree using the maximum spanning tree algorithm.
  5. Compute the cluster potential for each cluster as the product of factors associated with the nodes present in the cluster.

In pgmpy, each graphical model class has a method called to_junction_tree(), which creates a clique tree (or junction tree) corresponding to the graphical model. Here's an example:

In [1]: from pgmpy.models import BayesianModel, MarkovModel
In [2]: from pgmpy.factors import TabularCPD, Factor

# Create a bayesian model as we did in the previous chapters
In [3]: model = BayesianModel(
                      [('rain', 'traffic_jam'),
                       ('accident', 'traffic_jam'),
                       ('traffic_jam', 'long_queues'), 
                       ('traffic_jam', 'late_for_school'),
                       ('getting_up_late', 'late_for_school')])
In [4]: cpd_rain = TabularCPD('rain', 2, [[0.4], [0.6]])
In [5]: cpd_accident = TabularCPD('accident', 2, [[0.2], [0.8]])
In [6]: cpd_traffic_jam = TabularCPD(
                            'traffic_jam', 2,
                            [[0.9, 0.6, 0.7, 0.1], 
                            [0.1, 0.4, 0.3, 0.9]],                                                                        
                            evidence=['rain', 'accident'],
                            evidence_card=[2, 2])
In [7]: cpd_getting_up_late = TabularCPD('getting_up_late', 2,
                                         [[0.6], [0.4]])
In [8]: cpd_late_for_school = TabularCPD(
                            'late_for_school', 2,                                                             
                            [[0.9, 0.45, 0.8, 0.1],                                             
                            [0.1, 0.55, 0.2, 0.9]],
                            evidence=['getting_up_late',                       
                                      'traffic_jam'],
                            evidence_card=[2, 2])
In [9]: cpd_long_queues = TabularCPD('long_queues', 2,                          
                                     [[0.9, 0.2],
                                      [0.1, 0.8]],
                                     evidence=['traffic_jam'],
                                     evidence_card=[2])
In [10]: model.add_cpds(cpd_rain, cpd_accident, 
                        cpd_traffic_jam, cpd_getting_up_late, 
                        cpd_late_for_school, cpd_long_queues)
In [11]: junction_tree_bm = model.to_junction_tree() 
In [12]: type(junction_tree_bm)
Out[12]: pgmpy.models.JunctionTree.JunctionTree

In [13]: junction_tree_bm.nodes()
Out[13]:
[('traffic_jam', 'getting_up_late', 'late_for_school'),
 ('traffic_jam', 'rain', 'accident'),
 ('traffic_jam', 'long_queues')]

In [14]: junction_tree_bm.edges()
Out[14]: 
[(('traffic_jam', 'long_queues'),
  ('traffic_jam', 'late_for_school', 'getting_up_late')),
 (('traffic_jam', 'long_queues'), ('traffic_jam', 'rain', 
'accident'))]

Tip

The to_junction_tree() method is available in FactorGraph, MarkovModel classes as well.

Message passing

Let's go back to the previous example of the Bayesian network for the late- for school- example:

Message passing

Fig 3.6: Bayesian network for a student being late for school.

In the previous section, we saw how to construct a clique tree for this Bayesian network. The following figure shows the clique tree for this network:

Message passing

Fig 3.7: Clique tree constructed from the Bayesian network presented in Fig 3.3

As we discussed earlier, in the belief propagation algorithm, we would be considering factor Message passing to be a computational data structure that would take a message Message passing generated from a factor Message passing, and produce Message passing, which can be further passed on to another factor Message passing, and so on.

Let's go into the details of what each message term (Message passing and Message passing) means. Let's begin with a very simple example of finding the probability of being late for school (L). To compute the probability of L, we need to eliminate all the random variables, such as accident (A), rain (R), traffic jam (J), getting up late (G), and long queues (Q). We can see that variables A and R are present only in cluster Message passing and Q is present only in Message passing, but J is present in all three clusters, namely Message passing, Message passing, and Message passing. So, both A and R can be eliminated from Message passing by just marginalizing Message passing with respect to A and R. Similarly, we could eliminate Q from Message passing. However, to eliminate J, we can't just eliminate it from Message passing, Message passing, or Message passing alone. Instead, we need contributions from all three.

Eliminating the random variables A and R by marginalizing the cluster potential Message passing corresponding to Message passing, we get the following:

Message passing

Similarly, marginalizing the cluster potential Message passing corresponding to Message passing with respect to Q, we get the following:

Message passing

Now, to eliminate J and G, we must use Message passing, Message passing, and Message passing. Eliminating J and G, we get the following:

Message passing

From the perspective of message passing, we can see that Message passing produces an output message Message passing. Similarly, Message passing produces a message Message passing. These messages are then used as input messages for Message passing to compute the belief for a cluster Message passing. Belief for a cluster Message passing is defined as the product of the cluster potential Message passing with all the incoming messages to that cluster. Thus:

Message passing

So, we can re-frame the following equation:

Message passing

It can be re-framed as follows:

Message passing

Fig 3.8 shows message propagation from clusters Message passing and Message passing to cluster Message passing:

Message passing

Fig 3.8: Message propagation from clusters Message passing and Message passing to cluster Message passing:

Now, let's consider another example, where we compute the probability of long queues (Q). We have to eliminate all the other random variables, except Q. Using the same approach as discussed earlier, first marginalize Message passing with respect to A and R, and compute Message passing as follows:

Message passing

As discussed earlier, to eliminate the variable J, we need contributions from Message passing, Message passing, and Message passing, so we can't simply eliminate J from Message passing. The other two random variables L and G are only present in Message passing, so we can easily eliminate them from Message passing. However, to eliminate L and G from Message passing, we can't simply marginalize Message passing. We have to consider the contribution of Message passing (the outgoing message from Message passing) as well, because J was present in both the clusters Message passing and Message passing. Thus, eliminating L and G would create Message passing as follows:

Message passing

Finally, we can eliminate J by marginalizing the following belief Message passing of Message passing:

Message passing

We can eliminate it as follows:

Message passing

Fig 3.9 shows a message passing from Message passing to Message passing and Message passing to Message passing:

Message passing

Fig 3.9: Message passing from Message passing to Message passing and Message passing to Message passing

In the previous examples, we saw how to perform variable elimination in a clique tree. This algorithm can be stated in a more generalized form. We saw that variable elimination in a clique tree induces a directed flow of messages between the clusters present in it, with all the messages directed towards a single cluster, where the final computation is to be done. This final cluster can be considered as the root. In our first example, the root was Message passing, while in the second example, it was Message passing. The notions of directionality and root also create the notions of upstream and downstream. If Message passing is on the path from Message passing to the root, then we can say that Message passing is upstream from Message passing, and Message passing is downstream from Message passing:

Message passing

Fig 3.10: Message passing is upstream from Message passing and Message passing is downstream from Message passing

We also saw in the second example that Message passing was not able to send messages to Message passing until it received the message from Message passing, as the generation of Message passing also depends on Message passing. This introduces the notion of a ready cluster. A cluster is said to be ready to transmit messages to its upstream cliques if it has received all the incoming messages from its downstream cliques.

The message Message passing from the cluster j to the cluster i can be defined as the factor formed by the following sum product message passing computation:

Message passing

We can now define the terms belief Message passing of a cluster Message passing. It is defined as the product of all the incoming messages Message passing from its neighbors with its own cluster potential:

Message passing

Here, j is the upstream from i.

All these discussions for running the algorithm can be summarized in the following steps:

  1. Identify the root (this is the cluster where the final computation is to be made).
  2. Start with the leaf nodes of the tree. The output message of these nodes can be computed by marginalizing its belief. The belief for the leaf node would be its cluster potential as there would be no incoming message.
  3. As and when the other clusters of the clique tree become ready, compute the outgoing message and propagate them upstream.
  4. Repeat step 3 until the root node has received all the incoming messages.

Clique tree calibration

In the previous section, we discussed how to compute the probability of any variable using belief propagation. Now, let's look at the broader picture. What if we wanted to compute the probability of more than one random variable? For example, say we want to know the probability of long queues as well as a traffic jam. One naive solution would be to do a belief propagation in the clique tree by considering each cluster as a root. However, can we do better?

Consider the previous two examples we have discussed. The first one had Clique tree calibration as the root, while the other had Clique tree calibration. We saw that in both cases, message computed from the cluster Clique tree calibration to the cluster Clique tree calibration (that is Clique tree calibration) is the same, irrespective of the root node. Generalizing this, we can conclude that the message Clique tree calibration from the cluster Clique tree calibration to the cluster Clique tree calibration will be the same as long as the root is on the Clique tree calibration side and vice versa. Thus, for a given edge in the clique tree between two clusters Clique tree calibration and Clique tree calibration, we have only two messages to compute, depending on the directionality of the edges (Clique tree calibration and Clique tree calibration). For a given clique tree with c clusters, we have Clique tree calibrationedges between these clusters. Thus, we only need to compute Clique tree calibration messages.

As we have seen in the previous section, a cluster can propagate a message upstream as soon as it is ready, that is, when it has received all the incoming messages from downstream. So, we can compute both messages for each edge asynchronously. This can be done in two phases, one being an upward pass and the other being a downward pass. In the upward pass (Fig 3.11), we consider a cluster as a root and send all the messages to the root. Once the root has all the messages, we can compute its belief. For the downward pass (Fig 3.12), we can compute appropriate messages from the root to its children using its belief. This phase will continue until there is no message to be transmitted, that is, until we have reached the leaf nodes. This is shown in Fig 3.11:

Clique tree calibration

Fig 3.11: Upward pass

Fig 3.11 shows an upward pass where cluster {E, F, G} is considered as the root node. All the messages from the other nodes are transmitted towards it.

The following figure shows a downward pass where the appropriate message from the root is transmitted to all the children. This will continue until all the leaves are reached:

Clique tree calibration

Fig 3.12: Downward pass

When both, the upward pass and the downward pass are completed, all the adjacent clusters in the clique tree are said to be calibrated. Two adjacent clusters i and j are said to be calibrated when the following condition is satisfied:

Clique tree calibration

In a broader sense, it can be said that the clique tree is calibrated. When a clique tree is calibrated, we have two types of beliefs, the first being cluster beliefs and the second being sepset beliefs. The sepset belief for a sepset Clique tree calibration can be defined as follows:

Clique tree calibration

Message passing with division

Until now, we have viewed message passing in the clique tree from the perspective of variable elimination. In this section, we will see the implementation of message passing from a different perspective, that is, from the perspective of clique beliefs and sepset beliefs. Before we go into details of the algorithm, let's discuss another important operation on the factor called factor division.

Factor division

A factor division between two factors Factor division and Factor division, where both X and Y are disjoint sets, is defined as follows:

Factor division

Here, we define Factor division. This operation is similar to the factor product, except that we divide instead of multiplying. Further, unlike the factor product, we can't divide factors not having any common variables in their scope. For example, consider the following two factors:

a

b

Factor division

Factor division

Factor division

0

Factor division

Factor division

1

Factor division

Factor division

2

Factor division

Factor division

3

Factor division

Factor division

4

Factor division

Factor division

5

b

Factor division

Factor division

0

Factor division

1

Factor division

2

Dividing Factor division by Factor division, we get the following:

a

b

Factor division

Factor division

Factor division

0

Factor division

Factor division

1

Factor division

Factor division

1

Factor division

Factor division

0

Factor division

Factor division

4

Factor division

Factor division

2.5

In pgmpy, factor division can implemented as follows:

In [1]: from pgmpy.factors import Factor
In [2]: phi1 = Factor(['a', 'b'], [2, 3], range(6))
In [3]: phi2 = Factor(['b'], [3], range(3))
In [4]: psi = phi1 / phi2
In [5]: print(psi)
╒═════╤═════╤════════════╕
│ a   │ b   │   phi(a,b) │
╞═════╪═════╪════════════╡
│ a_0 │ b_0 │     0.0000 │
├─────┼─────┼────────────┤
│ a_0 │ b_1 │     1.0000 │
├─────┼─────┼────────────┤
│ a_0 │ b_2 │     1.0000 │
├─────┼─────┼────────────┤
│ a_1 │ b_0 │     0.0000 │
├─────┼─────┼────────────┤
│ a_1 │ b_1 │     4.0000 │
├─────┼─────┼────────────┤
│ a_1 │ b_2 │     2.5000 │
╘═════╧═════╧════════════╛

Let's go back to our original discussion regarding message passing using division. As we saw earlier, for any edge between clusters Factor division and Factor division, we need to compute two messages Factor division and Factor division. Let's assume that the first message was passed from Factor division to Factor division, that is, Factor division. So, a return message from Factor division to Factor division would only be passed when Factor division has received all the messages from its neighbors.

Once Factor division has received all the messages from its neighbors, we can compute its belief Factor division as follows:

Factor division

In the previous section, we also saw that the message from Factor division to Factor division can be computed as follows:

Factor division

From the preceding mathematical formulation, we can deduce that the belief of Factor division, that is, Factor division, can't be used to compute the message from Factor division to Factor division as it would already include the message from Factor division to Factor division in it:

Factor division

That is:

Factor division

Thus, from the preceding equation, we can conclude that the message from Factor division to Factor division can be computed by simply dividing the final belief of Factor division, that is, Factor division, with the message from Factor division to Factor division, that is, Factor division:

Factor division

Finally, the message passing algorithm using this process can be summarized as follows:

  1. For each cluster Factor division, initialize the initial cluster belief Factor division as its cluster potential Factor division and sepset potential between adjacent clusters Factor division and Factor division, that is, Factor division as 1.
  2. In each iteration, the cluster belief Factor division is updated by multiplying it with the message from its neighbors, and the sepset potential Factor division is used to store the previous message passed along the edge (Factor division), irrespective of the direction of the message.
  3. Whenever a new message is passed along an edge, it is divided by the old message to ensure that we don't count this message twice (as we discussed earlier).

    Steps 2 and 3 can formally be stated in the following way for each iteration:

    Factor division
  4. Here, we marginalize the belief to get the message passed. However, as we discussed earlier, this message will include a message from Factor division to Factor division in it, so divide it by the previous message stored in Factor division:
    Factor division
  5. Update the belief by multiplying it with the message from its neighbors:
    Factor division
  6. Update the sepset belief:
    Factor division
  7. Repeat steps 2 and 3 until the tree is calibrated for each adjacent edge (Factor division):
    Factor division

As this algorithm updates the belief of a cluster using the beliefs of its neighbors, we call it the belief update message passing algorithm. It is also known as the Lauritzen-Spiegelhalter algorithm.

In pgmpy, this can be implemented as follows:

In [1]: from pgmpy.models import BayesianModel
In [2]: from pgmpy.factors import TabularCPD, Factor
In [3]: from pgmpy.inference import BeliefPropagation

# Create a bayesian model as we did in the previous chapters
In [4]: model = BayesianModel(
                      [('rain', 'traffic_jam'), 
                       ('accident', 'traffic_jam'),
                       ('traffic_jam', 'long_queues'), 
                       ('traffic_jam', 'late_for_school'),
                       ('getting_up_late', 'late_for_school')])

In [5]: cpd_rain = TabularCPD('rain', 2, [[0.4], [0.6]])
In [6]: cpd_accident = TabularCPD('accident', 2, [[0.2], [0.8]])
In [7]: cpd_traffic_jam = TabularCPD('traffic_jam', 2,
                                     [[0.9, 0.6, 0.7, 0.1],
                                      [0.1, 0.4, 0.3, 0.9]],                                                             
                                     evidence=['rain',
                                               'accident'],
                                     evidence_card=[2, 2])
In [8]: cpd_getting_up_late = TabularCPD('getting_up_late', 2,
                                         [[0.6], [0.4]])
In [9]: cpd_late_for_school = TabularCPD(
                         'late_for_school', 2,
                         [[0.9, 0.45, 0.8, 0.1], 
                          [0.1, 0.55, 0.2, 0.9]],
                         evidence=['getting_up_late','traffic_jam'],
                         evidence_card=[2, 2])
In [10]: cpd_long_queues = TabularCPD('long_queues', 2,
                                      [[0.9, 0.2], 
                                      [0.1, 0.8]],
                                      evidence=['traffic_jam'],
                                      evidence_card=[2])

In [11]: model.add_cpds(cpd_rain, cpd_accident, 
                        cpd_traffic_jam, cpd_getting_up_late,  
                        cpd_late_for_school, cpd_long_queues)

In [12]: belief_propagation = BeliefPropagation(model)

# To calibrate the clique tree, use calibrate() method
In [13]: belief_propagation.calibrate()

# To get cluster (or clique) beliefs use the corresponding getters
In [14]: belief_propagation.get_clique_beliefs()
Out[14]:
{('traffic_jam', 'late_for_school', 'getting_up_late'): <Factor representing phi(getting_up_late:2, late_for_school:2, traffic_jam:2) at 0x7f565ee0db38>,
 ('traffic_jam', 'long_queues'): <Factor representing phi(long_queues:2, traffic_jam:2) at 0x7f565ee0dc88>,
 ('traffic_jam', 'rain', 'accident'): <Factor representing phi(rain:2, accident:2, traffic_jam:2) at 0x7f565ee0d4a8>}

# To get the sepset beliefs use the corresponding getters 
In [15]: belief_propagation.get_sepset_beliefs()
Out[15]: {frozenset({('traffic_jam', 'long_queues'),
                    ('traffic_jam', 'rain', 'accident')}): <Factor 
representing phi(traffic_jam:2) at 0x7f565ee0def0>,
         frozenset({('traffic_jam', 'late_for_school', 
'getting_up_late'),
                   ('traffic_jam', 'long_queues')}): <Factor representing phi(traffic_jam:2) at 0x7f565ee0dc18>}

Querying variables that are not in the same cluster

In the previous section, we saw how to compute the probability of variables present in the same cluster. Now, let's consider a situation where we want to compute the probability of both being late for school (L) and long queues (Q). These two variables are not present in the same cluster. So, to compute their probabilities, one naive approach would be to force our clique tree to have these two variables in the same cluster. However, this clique tree is not the optimal one, hence it would negate all the advantages of the belief propagation algorithm. The other approach is to perform variable elimination over the calibrated clique tree.

The algorithm for performing queries of variables not present in same cluster can be summarized as follows:

  1. Select a subtree Querying variables that are not in the same cluster of the calibrated clique tree Querying variables that are not in the same cluster, such that the query variable Querying variables that are not in the same cluster. Let Querying variables that are not in the same cluster be a set of factors on which variable elimination is to be performed. Select a cluster of the clique tree Querying variables that are not in the same cluster as the root node and add its belief to Querying variables that are not in the same cluster for each node in the clique tree Querying variables that are not in the same cluster except the root node.
    Querying variables that are not in the same cluster
  2. Add it to Querying variables that are not in the same cluster. Let Z be a random set of random variables present in Querying variables that are not in the same cluster, except for the query variables. Perform variable elimination on the set of factors Querying variables that are not in the same cluster with respect to the variables Z.

In pgmpy, this can be implemented as follows:

In [15]: belief_propagation.query(
                         variables=['no_of_people'],                
                         evidence={'location': 1, 'quality': 1})
Out[15]: {'no_of_people': <Factor representing phi(no_of_people:2)
                                               at 0x7f565ee0def0>

MAP inference

Until now, we have been doing conditional probability queries only on the model. However, sometimes, rather than knowing the probability of some given states of variables, we might be interested in finding the states of the variables corresponding to the maximum probability in the joint distribution. This type of problem often arises when we want to predict the states of variables in our model, which is our general machine learning problem. So, let's take the example of our restaurant model. Let's assume that for some restaurant we know of, the quality is good, the cost is low, and the location is good, and we want to predict the number of people coming to the restaurant. In this case, rather than querying for the probabilities of states of the number of people, we would like to query for the state that has the highest probability, given that the quality is good, the cost is low, and the location is good. Similarly, in the case of speech recognition, given a signal, we are interested in finding the most likely utterance rather than the probability of individual phonemes.

Putting the MAP problem more formally, we are given a distribution MAP inference defined by a set MAP inference and an unnormalized MAP inference, and we want to find an assignment MAP inference whose probability is at maximum:

MAP inference

In the earlier equation, we used the unnormalized distribution to compute MAP inference as it helps us avoid computing the full distribution, because computing the partition function Z requires all the values of the distribution. Overall, the MAP problem is to find the assignment MAP inference for which MAP inference is at maximum.

A number of algorithms have been proposed to find the most likely assignment. Most of these use local maximum assignments and graph structures to find the global maximum likely assignment.

We define the max-marginal of a function f relative to a set of variables Y as follows:

MAP inference

In simple words, MAP inference returns the unnormalized probability value of the most likely assignment in MAP inference. Most of the algorithms work on first computing this set of local max-marginals, that is MAP inference, and then use this to compute the global maximum assignment, as we will see in the next sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset