The Gaussian mixture model

The Gaussian mixture model is the first example of a latent variable model. Latent variables are also called hidden variables and are variables that are present in the model but are never observed.

The notion of using unobserved variables can be surprising at first because we might wonder how to estimate the parameters of the distribution of such a variable. In fact, we might wonder what the real meaning of such a latent variable is.

For example, let's say we observe data represented by a group of random variables. This data tends to group into clusters, aggregating together depending on their underlying meaning. For example, we could observe physiological traits from animals and group those data points by species such as dogs, cats, cows, and so on. If we think in terms of generating models, then we can say that, by choosing a group such as for example pony, we will observe features that are specific to this group and not to another group such as cats. However, none of the physiological variables carry an explicit reference to the fact that they come from the pony group or the cat group.

This distinction only exists because we want to group things together but they're not part of the real world. However, it helps a lot to group data like this, by pony or cat features, to understand animals in general. This is the grouping we want to do with a latent variable.

Using latent variables in a model can be problematic at first because there is no data to estimate their distribution. However, we saw before that algorithms such as EM can be very helpful when it comes to solving this task.

Moreover, we can simplify the model by introducing some conditional independence between features or a hierarchy between variables, thus making the model easier to interpret or easier to compute.

The Gaussian mixture model is a latent variable mainly used for density estimation problems, where, roughly speaking, the main assumption is that a random process will, according to a multinomial distribution, choose a Gaussian at random and then (according to this selected Gaussian distribution) choose a data point at random. It is a very simple 2-step process. And it also simplifies the problem of estimating a complex dataset distribution. Instead of searching for a very complex distribution, the model estimates it with a set of simple Gaussian distributions, joined by the latent variable. It is similar to a divide-and-conquer approach.

Definition

In this model we will call X the variable we can observe and we will call Z a multinomial random variable that is hidden. The model is defined by the following probability distribution:

Definition

Here, πi are the mixing proportions and p(X | Zi = 1, θi) are the mixture components. In this formula, because Z is a multinomial distribution, we have p(Zi | πi) = πi and Zi is the ith component of Z.

Finally Θ is the set of all parameters of the model and θi are the parameters of the X variables.

When X has a Gaussian distribution, we can write the preceding distribution as:

Definition

Here, Σi are the covariance matrices of the variable X. If we expand the formula we obtain:

Definition

This result looks dense but it's time to draw the corresponding probabilistic graphical model and see the equivalence in it:

Definition

This represents the probability distribution P(X,Z) where, this time, the Z node is white to indicate that it is hidden or latent. If we marginalize out the variable Z from this model to obtain the distribution on X we will obtain exactly the previous formula. Also note that, in this model, X has a multivariate Gaussian distribution.

From this distribution it is also simple to compute the posterior of Z given X. This is a value of interest: we want to know what the probability is of Z being in state i after observing Z; in other words, it gives us information about which distribution the observed variable comes from. Indeed, let's first draw a mixture of three Gaussians in R with the following code. This time we will use the package called mixtools:

N <- 400

X <- list(
  mvrnorm(N, c(1,1), matrix(c(1,-0.5,-0.5,1),2,2)/4),
  mvrnorm(N, c(3,3), matrix(c(2,0.5,0.5,1),2,2)/4),
  mvrnorm(N, c(5,5), matrix(c(1,-0.5,-0.5,4),2,2)/4))

plot(0,0,xlim=c(-1,7),ylim=c(-1,7),type='n')

for(i in 1:3)
  points(X[[i]],pch=18+i, col=1+i)

This little program simply generates three sets of data from a multivariate Gaussian distribution with two dimensions. It plots the three datasets on the same graph and as expected the points create three clusters of data, as shown in the next figure. If we regroup the three little datasets into a big one, one interesting problem would be to find out the parameters of the three clusters. In this example, we have an ideal situation because we generated an equal number of points for each cluster. But in real applications, this will rarely be the case.

Definition

The posterior probability of the hidden variable Z can be written as

Definition

This is a simple application of the Bayes rule again.

The next step is to estimate the parameters θ of the model, assuming again that the data is iid. If we call D = {xn} the dataset, we can, as before, write the log-likelihood of the model:

Definition

This log-likelihood is a bit hard to optimize and we will use an adequate optimization algorithm. As mentioned before, the fact that a variable is hidden will lead us to use the EM algorithm.

From the previous example, we will assume that the variable Z has three states {z1,z2,z3} for each of the three Gaussian components. The only constraint in the Gaussian mixture model is that one has to assume the number of Gaussians beforehand. Other models exist where the number of Gaussian components is also a random variable and the learning algorithm will try to discover the most probable number of components while at the same time finding the mean and covariance matrix of each component.

Here we use the same code as previously:

library(mixtools)

N <- 400

X <- list(
  mvrnorm(N, c(1,1), matrix(c(1,-0.5,-0.5,1),2,2)/4),
  mvrnorm(N, c(3,3), matrix(c(2,0.5,0.5,1),2,2)/4),
  mvrnorm(N, c(5,5), matrix(c(1,-0.5,-0.5,4),2,2)/4))
x <- do.call(rbind,X) # transform X into a matrix
model2 <-  mvnormalmixEM(x, verb=TRUE)
model3 <- mvnormalmixEM(x, k=3,verb=TRUE)

It will take some time to compute the results. The parameter verb=TRUE displays the result of each iteration of the EM algorithm. What is interesting to see is the log-likelihood. In the first case (model2), the log-likelihood will go from approximately 3711 to -3684 in 27 steps. Your results might be different because remember that we generate the dataset at random using mvrnorm.

The problem with model2 is that the number of components is taken by default to be 2: you can perform help(mvnormalmixEM) in R to see the parameter k. And we know we have three components in this mixture. However, model3 has a number of components k=3, closer to the real dataset, and obviously the log-likelihood will be closer to 3. It goes from 3,996 to only 3,305 in 41 iterations (again it might be slightly different on your computer). So it seems the convergence has been far better in the second case when we assume the correct number of components.

We can now plot the log-likelihood evolution of the EM algorithm to understand the difference between the two models:

plot(model2,xlim=c(0,50),ylim=c(-4000,-3000))
par(new=T)
plot(model,lt=3,xlim=c(0,50),ylim=c(-4000,-3000))

Note

Note that, by fixing the size of the graph, we can easily superimpose the two graphs. The dashed line is the graph corresponding to the model with three components. It is clear that the log-likelihood gets closer to zero during this algorithm. However, it takes more iterations to reach this result.

Definition

By looking at the results in model3, we can have a better understanding of the model that has been found by the EM algorithm:

model3$lambda
[1] 0.3358283 0.3342840 0.3298877

The proportions of each component are, as expected, very close to our initial proportions. You can change the number of points for each component and check again:

X <- list(
    mvrnorm(100, c(1,1), matrix(c(1,-0.5,-0.5,1),2,2)/4),
    mvrnorm(200, c(3,3), matrix(c(2,0.5,0.5,1),2,2)/4),
    mvrnorm(300, c(5,5), matrix(c(1,-0.5,-0.5,4),2,2)/4))
x <- do.call(rbind,X)

And then let's rerun it with mixtools:

model3.2 <- mvnormalmixEM(x, k=3,verb=TRUE)

We can see the log-likelihood going from approximately -1925 to -1691 in 84 iterations. But now the proportions are 0.3378457, 0.1651263, and 0.4970280 which indeed correspond to the proportions we initially set in our toy dataset.

Again we can check the other parameters, and see they are similar to those we set up in our dataset. In a real-world application, we don't have any idea of the location and covariance of each component, of course. But this example shows that the EM algorithm usually converges to the desired values:

model3$mu
[[1]]
[1] 3.025684 3.031763
[[2]]
[1] 0.9854002 1.0289426
[[3]]
[1] 4.989129 5.076438

Now it's time to look at the result more graphically to really understand which components have been found by the EM algorithm.

First of all, we plot model3 with the following command and display its three components with plot(model3, which=2):

Definition

Then we display, for comparison purposes, model2 and model3.2:

Definition

The following figure shows model3.2:

Definition

And now we conclude from our observations:

  • model3 and model3.2 are extremely similar, as expected.
  • model2, for which we chose to have only two components, seems to have made an acceptable choice with the components. Indeed the two bottom components have an almost similar orientation. So the algorithm converged toward a solution in which one Gaussian will embed the two bottom components, and another Gaussian will embed the top one, which has a different orientation. It is a good result.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset