Variable elimination

The previous example is quite impressive and seems to be complex. In the following sections we are going to see how to deal with such complex problems and how to perform inference on them, whatever the model is. In practice, we will see that things are not as idyllic as they seem to be and there are a few restrictions. Moreover, as we saw in the first chapter, when one solves the problem of inference, one has to deal with the NP-hard problem, which leads to algorithms that have an exponential time complexity.

Nevertheless there are dynamic programming algorithms that can be used to achieve a high degree of efficiency in many problems of inference.

We recall that inference means the computing of a posterior distribution of a subset of variables, given observed values of another subset of the variables of the model. Solving this problem in general means we can choose any disjoint subsets.

Let Variable elimination be the set of all the variables in the graphical model and let Y and E be two disjoint subsets of variables Variable elimination. We will use Y as the query subset, that is, the variables we want to know the posterior distribution about, and E as the observation subset, that is, the variables for which we have an observation, also called an evidence (hence the E).

Therefore the general form of a query is Variable elimination according to the Bayes theorem, as seen in Chapter 1, Probabilistic Reasoning. In fact, P(Y,e) can be considered as a function on Y such that Variable elimination—that is, the probability of having Y=y and E=e at the same time.

Finally, we can define W by Variable elimination—that is, the subset of variables in the graphical model that are neither considered for the query, nor observed. Then we can compute by Variable elimination. Indeed if we marginalize over the variables in W, we are left with P(Y,E).

If we apply the same reasoning, we can also compute the probability of the evidence P(E=e), such as for example Variable elimination.

Therefore the general mechanism for Bayesian inference is to marginalize over the unwanted and observed variables to be left with the query's variables.

Let's take a simple example with the following graph:

Variable elimination

This graphical model encodes the probability distribution:

P(ABCD) = P(A).P(B|A).P(C|B).P(D|C)

It is a very simple chain that will serve the purpose of our presentation of the variable elimination algorithm. To each node of the graph, as we saw before, we associate a potential function, which in the case of a directed graphical model such as this one is simply the conditional probability of the variable given its parents: P(A), P(B|A), P(C|B) and P (D|C). If P(A) can be directly read from the graph's associated functions, P(B) has to be computed by marginalizing over A: Variable elimination.

It looks simple but it can become a very computing-intensive operation to run (OK, not that much but you get the point). If Variable elimination and Variable elimination (that is, A is a variable with k possible values and B with m values), doing the previous sum requires 2m.k - m operations. To understand this we can write the sum:

Variable elimination

And this has to be computed for each and every m value of B.

After doing this operation, we marginalized out A and we could say that we obtained an equivalent graphical model such as this one:

Variable elimination

Here, the distribution of B has been updated with the information from A. If we want to find the marginal distribution of C, we can apply the same algorithm again and obtain P(C). And again if we want to obtain P(D). So in the end, what we've done, in order to obtain P(D), is the following summations:

Variable elimination

However. because in each summation we only need to address the concerned variables in the sum. we can rewrite the sum as:

Variable elimination

This considerably simplifies the computations because they have to be done on local distributions only. As an exercise, I will let the reader show that for a graphical model representing a chain of n variables in ℝk then the complexity of the computations goes from O(kn) to only. Remember that the big O notation means the upper bound of the computation time is proportional to the formula in the parentheses of the O function (also called the worst-case time complexity). Obviously, this scheme is efficient.

The main scheme that appears in this example is that we can sum variables out and then reuse the result for the next step. Ideally, we want to apply the same idea to any graph and eliminate variables step by step by caching intermediate results to use them again. This can be achieved because, thanks to the structure of the graphical model, each sub-expression in the summation only depends on a few variables and because we can cache the results along the path in the graph.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset