Maximum likelihood

This section introduces a simple algorithm to learn all the parameters of a graphical model as we saw until now. In the first section, we had our first experience of learning such a model and we concluded by saying that the parameters can be learned locally for each variable. It means that, for each variable x having parents pa(x) in the graph, for each combination of the parents pa(x) we compute frequencies for each value of x. If the dataset is complete enough, then this leads to the maximum likelihood estimation of the graphical models.

For each variable x in the graphical modeling, and for each combination c of the values of the parents of pa(x) of x:

  • Extract all the data points corresponding to the values in c
  • Compute a histogram Hc on the value of x
  • Assign Maximum likelihood

Is that it? Yes it is, it's all you have to do. The difficult part is the extraction of the data points, which is a problem you can solve in R using the ddply or aggregate functions.

But why is it so simple? Before looking at an algorithm in R, let's see why this algorithm works.

How are empirical and model distribution related?

A graphical model represents a joint probability distribution over a set of variables X. But not every joint distribution can be represented as a graphical model. Here we are interested in directed probabilistic graphical models as we defined them before. The definition can also be seen as a constraint on the type of probability distribution we want to represent and in this case the constraint is represented by:

How are empirical and model distribution related?

So far, this is our well-known definition of a directed graphical model.

Definition: Empirical Distribution

Let How are empirical and model distribution related? be a set of data points, which are states of a variable X, then the empirical distribution has its mass evenly distributed over the data points and zero elsewhere. Source: Bayesian Reasoning and Machine Learning, D. Barber 2012, Cambridge University Press.

From χ, assuming the data points are identically and independently distributed, the empirical cumulative distribution function is How are empirical and model distribution related?. In other words it is the distribution where, for each possible state of X, one associates the computed frequency from the dataset and as zero if no data is present in the dataset.

Let's consider the relationship between the empirical distribution q(x) and the model distribution p(x).

The Kullback-Leibler divergence (also called relative entropy) is a non-symmetric measure of the difference between two probability distributions, q and p, noted as KL(q | p). It gives the expected number of bits required to transform a sample from q into a sample from p. Intuitively, if two distributions are equal then the Kullback-Leibler divergence is zero.

The KL divergence between the empirical distribution and a distribution p(x) is:

How are empirical and model distribution related?

The log-likelihood of the model p(x) being How are empirical and model distribution related?, we can see in the previous formula that the right-most term is the expected log-likelihood of the model p(x) under the empirical distribution q(x). We can therefore write that:

How are empirical and model distribution related?

And because the term How are empirical and model distribution related?is not dependent on p(x), the model distribution, we can consider it constant and write:

How are empirical and model distribution related?

Therefore, as we know that maximizing the likelihood is equivalent to minimizing the log likelihood, from the previous formula, given that the second term is a constant, minimizing this log-likelihood will also minimize the KL divergence between the empirical distribution q and the model distribution p. It simply means that finding the maximum likelihood parameters for p(x) is equivalent to minimizing the KL divergence between the empirical distribution and the model distribution.

If we have no constraint at all on p(x), then the solution is to take p(x)=q(x).

But recall that we have a few constraints: p(x) has to be a graphical model. So now, let's put the real p(x) from the graphical model into this formula to see what happens.

How are empirical and model distribution related?

Don't be scared by this double sum and just remember that How are empirical and model distribution related?; that is, we only took the logarithm of the probability distribution of a graphical model. This huge term can be simplified by noting that the outer sum only depends for each term of the inner sum on the variable xi, so we can now write:

How are empirical and model distribution related?

The inner sum now computes the expected log-likelihood under the distribution q restricted to the subset of variables xi, pa(xi).

Now let's add back the constant to this formula:

How are empirical and model distribution related?

Again, this formula looks big, but inside the brackets, if you look carefully, you will recognize the formula of a KL divergence, this time between q(xi, pa(xi)) and p(xi, pa(xi)). This beautiful result means we can simplify this formula again with:

How are empirical and model distribution related?

What we do in this final formula is a weighted sum of KL divergence. The probability distribution q(pa(xi)) and the KL divergence both being positive, it happens that minimizing this sum corresponds with the minimization of each of its terms, because all of them are positive. And therefore, minimizing this sum also corresponds with the maximum likelihood estimation of p(x) as we saw before. But if you look carefully at what is inside this sum, you will see more KL divergence, one for each little distribution associated with each node of the graph! And we have to minimize all of them. So it simply means that, if we want to minimize the whole KL divergence between q and p (and obtain the maximum likelihood estimation of p, our graphical model), we have to do the same thing on each node, independently of the other nodes, one by one. And minimizing all those KL divergences, as we saw before, is equivalent to counting and doing frequencies. Therefore, the maximum likelihood estimator for a directed graphical model is obtained by counting the data points (that is computing frequencies) on each node of the graph, by selecting only the data point associated with the parents of each node pa(xi).

The ML algorithm and its implementation in R

At this point, we are able to write a simple algorithm in R to learn the parameters of the graph. In this section we will use the Nursery dataset, again from UCI (https://archive.ics.uci.edu/ml/datasets/Nursery). The algorithm will not use any specific graphical model library but only the graph package and common R libraries.

The dataset has nine variables, related to nursery-school applications. The dataset was recorded in the 1980s in Ljubljana, Slovenia to rank applications to nurseries when the demand was too high, in order to build an expert system to objectively explain why an application was accepted or rejected. All the variables are categorical, which means we will only focus on discrete variables.

Our aim in this section is to illustrate what we have learnt so far in this chapter by application, so we will not try to make a perfect expert system. For this reason, we will use a simple graph to illustrate this example:

The ML algorithm and its implementation in R

The R code is as follows, including the learning function where we use two new packages, graph and Rgraphviz. You will note that lines are numbered (but not part of the code), for ease of reading:

  1 library(graph)
  2 library(Rgraphviz)
  3 library(plyr)
  4
  5 data0 <- data.frame(
  6     x=c("a","a","a","a","b","b","b","b"),
  7     y=c("t","t","u","u","t","t","u","u"),
  8     z=c("c","d","c","d","c","d","c","d"))
  9
 10 edges0 <- list(x=list(edges=2),y=list(edges=3),z=list())
 11 g0 <- graphNEL(nodes=names(data0),edgeL=edges0,edgemod="directed")
 12 plot(g0)
 13
 14 data1 <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data", col.names=c("parents","has_nurs","form","children","housing","finance","so    cial","health","class"))
 15 edges1 <- list( parents=list(), has_nurs=list(), form=list(), children=list(),
 16          housing=list(), finance=list(), social=list(), health=list(),
 17          class=list(edges=1:8) )
 18 g1 <- graphNEL(nodes=names(data1), edgeL=edges1,edgemod="directed")
 19 plot(g1)

From lines 1 to 3, we load the necessary packages. Then in line 5 we create a simple dataset to test the learning function. This dataset has three variables and will result in 50% probabilities for each combination.

In lines 10 and 11, we create the edges and the corresponding graph. In line 12, we plot the graph and obtain the following graphical model:

The ML algorithm and its implementation in R

From lines 14 to 19, we instantiate a second graphical model, this time using the Nursery dataset. The output of the plot function is:

The ML algorithm and its implementation in R

Next we have one simple function to compute the condition probability table for each variable with parents (or not):

  1 make_cpt<-function(df,pa)
  2 {
  3     prob <- nrow(df)
  4     parents <- data.frame(df[1,pa])
  5     names(parents) <- pa
  6
  7     data.frame(parents,prob)
  8 }

This function is in fact called later by the ddply function (the plyr package) and computes the frequency of one combination of the variable of interest and its parents. This is done is line 3 by a call to the nrow function.

The frequency here is only the number of times the same combination appears in the dataset. Because the combination is unique when this function is called (thanks to ddply), we extract the values of all the parents in line 4 from the first line only.

Finally, the main learning function is as follows. The code is not optimized but, on the contrary, is kept very explicit and simple so we can understand each piece:

  1 learn <- function(g,data)
  2 {
  3     rg <- reverseEdgeDirections(g)
  4     result <- list()
  5
  6     for(var in rg@nodes)
  7     {
  8         pa <- unlist(adj(rg,var))
  9         if(length(pa)>0)
 10         {
 11             X <- ddply(data, c(var,pa), make_cpt, pa)
 12             Y <- ddply(data, pa, make_cpt, pa)
 13             for(i in 1:nrow(Y))
 14             {
 15                 c <- sapply(1:nrow(X), function(j) all(X[j,pa] == Y[i,pa]))
 16                 c <- which(c)
 17                 X$prob[c] <- X$prob[c]/Y$prob[i]
 18             }
 19         }
 20         else
 21         {
 22             X <- ddply(data,var, function(df) c(prob=nrow(df)))
 23             X$prob <- X$prob/sum(X$prob)
 24         }
 25
 26         result[[length(result)+1]] <- X
 27     }
 28
 29     return(result)
 30 }

This function takes two parameters: the graph g and the dataset data. Then, in line 3, it reverses the direction of all edges, to be able to find the parents of each variable in line 8, using the function adj() from the graph package. There is nothing in the theory that says you have to reverse the graph; it's just a convenient way to find the parents.

In line 6, it will start to learn each variable independently, as we saw in the previous section. We deal with 2 problems, one when the variable has no parents (and therefore we compute the marginal distribution of the variable) and one when the variable has parents (and therefore we deal with conditional probability tables). See line 9.

In line 11, for each possible value of the variable and its parents, we compute P(var,pa(var)) frequencies (counts is more accurate). In line 12, we do the same thing for P(pa(var)).

Finally, from lines 13 to 18, we apply the Bayes formula to obtain conditional probability tables and transform the counts into probabilities (or frequencies in this case, too). Lines 22 and 23 do the same thing for marginal probability tables.

The result for each variable is stored in a list called result!

So let's use this function with the two datasets to see the same results, and analyze them.

Application

First we load and run the previous code in R and then perform the following code:

learn(g0,data0)

The result will be:

[[1]]
  x prob
1 a  0.5
2 b  0.5

[[2]]
  y x prob
1 t a  0.5
2 t b  0.5
3 u a  0.5
4 u b  0.5

[[3]]
  z y prob
1 c t  0.5
2 c u  0.5
3 d t  0.5
4 d u  0.5

This tells us that P(x = a) = 0.5 and P(x = b) = 0.5. Looking at the dataset, we have an equivalent number of a and b for the variable x. This is good.

The other tables are for P(y | x) and P(z | y). Remember that conditional probabilities are not probabilities directly. These tables have to be read by values of the parents. For example P(y = t | x = a) = 0.5 and P(y = u | x = a) = 0.5. And the sum is obviously 1.

Now let's apply the learn function to the Nursery dataset and look at the results:

learn(g1,data1)

To simplify the output, we only show a few variables and start with class:

       class         prob
1   not_recom       3.333591e-01
2   priority        3.291921e-01
3   recommend       7.716645e-05
4   spec_prior      3.120611e-01
5   very_recom      2.531059e-02

This is a marginal probability table as expected from the graph we had before. The sum is 1 and we see that some values have higher probabilities than others. This one will be used in the expert system to make conclusions. Of course, our model is very simple and a more realistic model could have a different graph with different values. We encourage the reader to modify the graph g1 to test different options.

If we look at the finance variable, we have the following table:

     finance      class      prob
1 convenient  not_recom 0.5000000
2 convenient   priority 0.5260197
3 convenient  recommend 1.0000000
4 convenient spec_prior 0.4589515
5 convenient very_recom 0.6646341
6     inconv  not_recom 0.5000000
7     inconv   priority 0.4739803
8     inconv spec_prior 0.5410485
9     inconv very_recom 0.3353659

This table is bigger than we saw before and behaves as expected. However, there is a small problem. This maximum likelihood procedure is by no means a Bayesian procedure but simply a frequentist. While it works in most cases, we can sometimes have problematic difficulties. Here, in line 3, we see that P(finance = convenient | class =recommend) =1.

While having a probability equal to one is not a problem, it is annoying. This is due to the fact that we only had one example of this specific combination in the dataset and this ended with this extreme result. This is not a desirable result as we want to be able to reach all possible scenarios and not fall into a unique scenario with probability 1.

We will see later in the book that it is interesting, in many cases, to add a prior distribution on all the parameters of the model, to prevent them from ever having a probability of 0 or 1 and to able to explore as many scenarios as possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset