Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Principles of the EM algorithm

Because the latent variables are not observed, the likelihood function of such a model is a marginal distribution where we have to sum out (or integrate out) the hidden variables. Marginalization will create dependencies between the variables and make the problem complex to solve.

The EM algorithm deals with this problem essentially by filling-in missing data with their expected values, given a distribution. When we iterate this process over and over, it will converge to the maximum likelihood solution. This filling-in is achieved by computing the posterior probability distribution of the hidden variables given a current set of parameters and the observed variables. This is what is done in the E-step, (E for Expectation). In the M-step, M for Maximization, the parameters of the models are adjusted and we iterate again with a new E-step. We will go on until we see a convergence in the parameters, or a convergence in the growth of the likelihood. Moreover, the EM algorithm gives us the certainty that, after each EM step, the likelihood of the model cannot go down. It means that, when the likelihood only increases by a tiny amount, we can conclude that the algorithm converged to a (local) maximum and we can stop the algorithm. The tiny amount depends on the application but it's not unusual to stop when the likelihood hasn't moved by more than 10-2 to 10-4. It's just a rule of thumb and the reader is encouraged to plot the likelihood curve to better understand the EM algorithm's behavior for his or her special case.

Derivation of the EM algorithm

Let's assume we have a dataset D of N iid points and the parameters of the graphical models are called θ. At this point, if you're wondering what this θ is, it is a big variable containing all the parameters for every variable of the graphical model. As it would be tedious to write a long collection, we simply resume them in a single high-dimensional variable.

Let D = {x₁, …, x_N} and θ ε R. The likelihood of the graphical model is defined by:

In the complete case where all the variables are observed, the likelihood function can be decomposed as follow:

We find again the result we saw before where the log-likelihood, in the case of a graphical model, can be rewritten as the sum of the local log-likelihood for each variable—that is, each node in the graph.

However, when we have hidden variables, we can't have this nice result. Let's call the observed variables x and the hidden variables y. The log-likelihood of the model can be written as:

Here, X = {x,y} is the set of all the variables. And here is our main problem: the sum inside the log function is not nice to compute. In fact, in order to obtain the likelihood function we have to marginalize out the y hidden variable. And having this sum in the log function can potentially make all the variables inside it dependent on each other. Therefore, we would lose all the benefit of having a nice factorization thanks to using a graphical model. In the end, computing would become intractable.

But if we take any distribution q(y) over the hidden variables, it can define a lower bound on the log-likelihood function. Let's see why:

This reasoning needs a bit of explanation now, line by line:

This is the standard definition of the log-likelihood over x where we marginalized out the hidden variables y.
Here we introduce q(y) at the numerator and denominator so that they cancel out each other.
Thanks to this, we can apply the Jensen inequality to obtain a lower bound. The right-hand side formula is the lower bound on L(θ).
The lower bound is simplified again and we have two terms where the right-most term is independent of θ and x.

So in the end, this new function F(q, θ) ≤ L(θ) is the lower bound on the log-likelihood.

The way the EM algorithm works is by alternating optimizations in two steps:

E-step:
M-step:

The algorithm is usually initialized with a random set of parameters: θ₀.

What this algorithm does is first find a new marginal distribution over the hidden variable q(y) given the current set of parameters θ_k-1 and then it finds the maximum likelihood estimation of the parameters θ_k using the previous distribution q. So in fact, in step 1, the E-step, the maximum of q is obtained by setting . And, at this point, the lower bound becomes an equality:

This result is very important because it guarantee that the likelihood can only go up or stay the same, step after step. So this result means that, using the current parameters θ_k-1, we infer the distribution p(y | x, θ_k-1) given the other observed variables. Any inference algorithm can be used at this stage, like those we saw in the previous chapter. Moreover, this creates an expected complete set of observations given the current parameters.

The maximum in the M-step is obtained by maximizing the first term in line 4 of the previous derivation; that is, the expected log-likelihood under the distribution q, the one we just computed.

So, at the beginning of teach loop of the EM algorithm, we have F=L and the E-step does not change θ. And so we know that the E-step will never decrease the likelihood. The log-likelihood will therefore only increase or stay the same. In practice, we will usually see a convergence of the log-likelihood. When the difference is very small, we can stop the algorithm and consider that the current solution is the good one.

Applying EM to graphical models

Practically we will consider a graphical model with discrete variables as we have been using so far. For example, let's say that somewhere in the graph we have two variables A and B, such that B is the parent of A. So locally we have the distribution P(A|B) associated to node A.

We recall that the maximum likelihood estimate θ_A|B is computed by:

This is what we saw and implemented before in R. So far, nothing new. Remember, we used the ddply function to efficiently compute this in R in one call. You could also use the aggregate function to obtain the same result.

But this formula is only valid when A and B are fully observed! And here they are not. Therefore, using the EM algorithm to overcome this problem is very simple.

The M-step is in this case:

But how do we obtain these two probability distributions? We obtain them in the E-step, using the observed X variables and our preferred inference algorithm. As for the parameters used in A and B, they are the parameters from the previous step of the EM algorithm.

In conclusion, in plain English, we recall the steps of the EM algorithm:

Initialize the graphical model with random parameters. Just be sure that the distribution sums to 1, of course. Random parameters seem to give better results than a uniform distribution, but I'm just giving you a practical tip here.
Until convergence of the log-likelihood, do the following:
- E-step: Compute the posterior distribution, using your preferred inference algorithm, of all the hidden variables. This is the q distribution.
- M-step: Compute the new set of parameters of the graphical model using the inferred distribution we saw before.
- Update the log-likelihood and check whether it converged, usually by checking whether the difference between the current likelihood and the previous one is smaller than a predefined threshold.

And so the M-step acts as if the hidden variables were observed by using the expected distribution on it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Principles of the EM algorithm

Create new playlist

Sign In

Sign Up

Principles of the EM algorithm

Derivation of the EM algorithm

Applying EM to graphical models

Table of Contents for
Principles of the EM algorithm