So far in our discussion, we have always considered that we already know the network model as well as the parameters associated with the network. However, constructing these models requires a lot of domain knowledge. In most real-life problems, we usually have some recorded observations of the variables. So, in this chapter, we will learn to create models using the data we have.
To understand this problem, let's say that the domain is governed by some underlying distribution, . This distribution is induced by the network model, . Also, we are provided with a dataset, of M samples. As these data points are obtained from our observations of the actual model, we can say that these data points have been sampled from the distribution, . Also, we can assume that all the data samples have been independently sampled from the distribution, . Such data samples are known as independently and identically distributed (IID) samples.
Now, we want to select a model from the family of models over the given variables, such that this model, , induces the probability distribution, , and this distribution is close to the underlying distribution of our domain.
In this chapter, we will discuss the following topics:
Before we discuss the specific methods to learn in the graphical models, in this section, we will briefly discuss some general ideas related to learning.
The perfect solution to our learning task would be to find a model, , so that the probability distribution induced by it is the same as the underlying distribution of our data. However, this is never possible in real life because of computational costs and lack of data. So, as we can't find the exact underlying distribution, we try to optimize our learning task, depending on the goal of learning. To make it clearer, we can think of two different situations. Let's say in the first case, we want to learn the model to answer conditional queries over some specific variables, whereas in the second case, we want to answer multiple queries involving all the variables of the network. Therefore, in the first case, we would like to optimize our learning over variables, over which we want to answer queries at the cost of getting a less-accurate distribution over the other variables. However, in the second case, we want our learned model to be as close to the underlying model as possible, because we have to answer queries over all the variables. Hence, we see that our goal of learning has a huge effect on our learning task.
One of the most common reasons to learn a graphical model is the inference tasks. In this case, we would like our learned model, , to induce a distribution, , which is as close to the underlying distribution as possible. To measure the distance between these two distributions, we can use the following relative entropy distance measure:
However, the problem with this measure is that we also need to know to compute the relative entropy. To solve this problem, we simplify the equation as follows:
Here, we see that the first term depends only on , and hence, it is unchanged for any choice of model. Therefore, we ignore this term and compare our models only on the basis of the second term, , and prefer the models that make this term as large as possible. This term is commonly known as expected log-likelihood. This term encodes the probability of our model to generate the given data points. Therefore, for a model that has high likelihood value for some given data, it would be closer to our underlying distribution of the data.
So, in our learning problem, we are interested in the likelihood of the data, when a model is given, M, that is, . For our convenience, we usually use log-likelihood denoted as . We also define log-loss as the negative of log-likelihood. Log-loss is an example of a loss function. A loss function, , determines the loss that our model makes on a particular data point, . Therefore, for better learning, we try to find a model that minimizes the expected loss, also known as risk:
However, as is not known, we can approximate this expected loss by averaging over the sampled data points:
Taking the example of log-loss and considering a data set, , we have the following equation:
Taking the logarithm of the preceding expression, we get the following equation:
As we saw earlier, this term is the negative of the empirical log-loss. Hence, we can easily get a good intuition of empirical risk using log-loss as the loss function.
In the preceding section, we tried to learn the complete underlying probability distribution, . For this, we used the log-likelihood function to select the most accurate model. The log-likelihood function uses complete assignments to compute the probability of how likely it is for the data represented by our model. Thus, models learned in this way can be used to answer a whole range of conditional or marginal probability queries over the variables of the model.
In many cases, though, we are more interested in answering a single conditional probability. Let's take the example of a simple classification problem using the Iris dataset for the classification of flower species. We are provided with five variables, namely sepallength
, sepalwidth
, petallength
, petalwidth
, and flowerspecies
. Now, we want to predict the species of a flower using the sepal length, sepal width, petal length, and petal width of a given flower. So, in this case, we always want to answer a specific conditional distribution over the variables, that is, . Rather in this case, we are interested in the MAP queries over the variable, flowerspecies
, when all the other variables are given. In real life, we have a lot of problems like this where we want to answer only some specific queries from our learned model.
Therefore, in such cases, we can select a different loss function that would better represent our problem. For example, in this case, we can use a classification error, also known as the 0/1 error:
Here, is an indicator function; represents the predicted value using the hypothesis, ; and y is the actual or target value.
In simple terms, this error function simply computes the probability over all terms sampled from , for which our model selects the wrong label. This error function is good for the case when we want to predict a single variable, or maybe a couple of variables. However, in cases when we want to predict a large number of variables, let's say in the case of image segmentation, we would not like to penalize the whole model for wrongly predicting the value of a single pixel. One suitable error function in such cases is Hamming loss, which also does consider the number of variables in each prediction that were predicted wrong.
Therefore, if we know in advance that we are going to use our model for a specific prediction task, we can always optimize our model for those variables.
Another problem that we might want to tackle through learning is that of knowledge discovery, in which we would like to know the relationships between the variables. So, in this case, we mostly focus on predicting the correct network structure. Though, as it turns out, it is very difficult to achieve this with good confidence. So, in the cases where we have a large amount of data, we may be able to construct a network structure with good confidence. In the case of Bayesian networks, there are a lot of I-equivalent structures for any given structure. Therefore, we can at the best hope to learn an I-equivalent structure from the data. Now, coming to the case when we don't have enough data, we will not be able to say anything very confidently about the relationship between the variables. For example, let's say that our data shows a weak correlation between two variables, but as we don't have enough data, we can't confirm this as it might be due to some noise in our data.
Thus, we can conclude that in the case of a knowledge discovery task, the most important thing is to focus on the confidence with which we predict the network structure. In the later sections, we will discuss how to approach such problems.