Chapter 5. Model Learning – Parameter Estimation in Bayesian Networks

So far in our discussion, we have always considered that we already know the network model as well as the parameters associated with the network. However, constructing these models requires a lot of domain knowledge. In most real-life problems, we usually have some recorded observations of the variables. So, in this chapter, we will learn to create models using the data we have.

To understand this problem, let's say that the domain is governed by some underlying distribution, Model Learning – Parameter Estimation in Bayesian Networks. This distribution is induced by the network model, Model Learning – Parameter Estimation in Bayesian Networks. Also, we are provided with a dataset, Model Learning – Parameter Estimation in Bayesian Networks of M samples. As these data points are obtained from our observations of the actual model, we can say that these data points have been sampled from the distribution, Model Learning – Parameter Estimation in Bayesian Networks. Also, we can assume that all the data samples have been independently sampled from the distribution, Model Learning – Parameter Estimation in Bayesian Networks. Such data samples are known as independently and identically distributed (IID) samples.

Now, we want to select a model from the family of models over the given variables, such that this model, Model Learning – Parameter Estimation in Bayesian Networks, induces the probability distribution, Model Learning – Parameter Estimation in Bayesian Networks, and this distribution is close to the underlying distribution of our domain.

In this chapter, we will discuss the following topics:

  • General ideas in learning
  • Maximum likelihood parameter estimation
  • Bayesian parameter estimation
  • Maximum likelihood structure learning
  • Bayesian structure learning

General ideas in learning

Before we discuss the specific methods to learn in the graphical models, in this section, we will briefly discuss some general ideas related to learning.

The goals of learning

The perfect solution to our learning task would be to find a model, The goals of learning, so that the probability distribution induced by it is the same as the underlying distribution of our data. However, this is never possible in real life because of computational costs and lack of data. So, as we can't find the exact underlying distribution, we try to optimize our learning task, depending on the goal of learning. To make it clearer, we can think of two different situations. Let's say in the first case, we want to learn the model to answer conditional queries over some specific variables, whereas in the second case, we want to answer multiple queries involving all the variables of the network. Therefore, in the first case, we would like to optimize our learning over variables, over which we want to answer queries at the cost of getting a less-accurate distribution over the other variables. However, in the second case, we want our learned model to be as close to the underlying model as possible, because we have to answer queries over all the variables. Hence, we see that our goal of learning has a huge effect on our learning task.

Density estimation

One of the most common reasons to learn a graphical model is the inference tasks. In this case, we would like our learned model, Density estimation, to induce a distribution, Density estimation, which is as close to the underlying distribution as possible. To measure the distance between these two distributions, we can use the following relative entropy distance measure:

Density estimation

However, the problem with this measure is that we also need to know Density estimation to compute the relative entropy. To solve this problem, we simplify the equation as follows:

Density estimation

Here, we see that the first term depends only on Density estimation, and hence, it is unchanged for any choice of model. Therefore, we ignore this term and compare our models only on the basis of the second term, Density estimation, and prefer the models that make this term as large as possible. This term is commonly known as expected log-likelihood. This term encodes the probability of our model to generate the given data points. Therefore, for a model that has high likelihood value for some given data, it would be closer to our underlying distribution of the data.

So, in our learning problem, we are interested in the likelihood of the data, when a model is given, M, that is, Density estimation. For our convenience, we usually use log-likelihood denoted as Density estimation. We also define log-loss as the negative of log-likelihood. Log-loss is an example of a loss function. A loss function, Density estimation, determines the loss that our model makes on a particular data point, Density estimation. Therefore, for better learning, we try to find a model that minimizes the expected loss, also known as risk:

Density estimation

However, as Density estimation is not known, we can approximate this expected loss by averaging over the sampled data points:

Density estimation

Taking the example of log-loss and considering a data set, Density estimation, we have the following equation:

Density estimation

Taking the logarithm of the preceding expression, we get the following equation:

Density estimation

As we saw earlier, this term is the negative of the empirical log-loss. Hence, we can easily get a good intuition of empirical risk using log-loss as the loss function.

Predicting the specific probability values

In the preceding section, we tried to learn the complete underlying probability distribution, Predicting the specific probability values. For this, we used the log-likelihood function to select the most accurate model. The log-likelihood function uses complete assignments to compute the probability of how likely it is for the data represented by our model. Thus, models learned in this way can be used to answer a whole range of conditional or marginal probability queries over the variables of the model.

In many cases, though, we are more interested in answering a single conditional probability. Let's take the example of a simple classification problem using the Iris dataset for the classification of flower species. We are provided with five variables, namely sepallength, sepalwidth, petallength, petalwidth, and flowerspecies. Now, we want to predict the species of a flower using the sepal length, sepal width, petal length, and petal width of a given flower. So, in this case, we always want to answer a specific conditional distribution over the variables, that is, Predicting the specific probability values. Rather in this case, we are interested in the MAP queries over the variable, flowerspecies, when all the other variables are given. In real life, we have a lot of problems like this where we want to answer only some specific queries from our learned model.

Therefore, in such cases, we can select a different loss function that would better represent our problem. For example, in this case, we can use a classification error, also known as the 0/1 error:

Predicting the specific probability values

Here, Predicting the specific probability values is an indicator function; Predicting the specific probability values represents the predicted value using the hypothesis, Predicting the specific probability values; and y is the actual or target value.

In simple terms, this error function simply computes the probability over all terms sampled from Predicting the specific probability values, for which our model selects the wrong label. This error function is good for the case when we want to predict a single variable, or maybe a couple of variables. However, in cases when we want to predict a large number of variables, let's say in the case of image segmentation, we would not like to penalize the whole model for wrongly predicting the value of a single pixel. One suitable error function in such cases is Hamming loss, which also does consider the number of variables in each prediction that were predicted wrong.

Therefore, if we know in advance that we are going to use our model for a specific prediction task, we can always optimize our model for those variables.

Knowledge discovery

Another problem that we might want to tackle through learning is that of knowledge discovery, in which we would like to know the relationships between the variables. So, in this case, we mostly focus on predicting the correct network structure. Though, as it turns out, it is very difficult to achieve this with good confidence. So, in the cases where we have a large amount of data, we may be able to construct a network structure with good confidence. In the case of Bayesian networks, there are a lot of I-equivalent structures for any given structure. Therefore, we can at the best hope to learn an I-equivalent structure from the data. Now, coming to the case when we don't have enough data, we will not be able to say anything very confidently about the relationship between the variables. For example, let's say that our data shows a weak correlation between two variables, but as we don't have enough data, we can't confirm this as it might be due to some noise in our data.

Thus, we can conclude that in the case of a knowledge discovery task, the most important thing is to focus on the confidence with which we predict the network structure. In the later sections, we will discuss how to approach such problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset