Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Structure learning in Bayesian networks

In the previous sections, we considered that we already know the network structure and we tried to estimate the parameters of the network using the data. However, it is quite possible that we might neither know the network structure nor have the domain knowledge to construct the network. Hence, in this section, we will discuss constructing the model structure when the data is given.

Constructing the model from the data is a difficult problem. Let's take an example of tossing two coins and representing the outcome of the first with the variable, X, and the second with the variable, Y. We know that if the coins are fair, these two random variables should be independent of each other. However, to get this independence condition just from the data, we need to have all these outcomes equal number of times in the data that we will rarely see in real life.

So, in general, we need to make some assumptions about the dependencies. The assumptions that we make will largely depend on our learning task, as we discussed in the previous sections. Now, in the case of knowledge discovery, we would like to know the dependencies between the variables; therefore, we would like our network structure to be as accurate as possible. We know that each distribution can have many P-maps; therefore, the best we can do is get an I-equivalent structure of the original network, . As we mentioned earlier, it is really hard to get the exact network structure, so we will often have a situation where we have to decide whether we want to include a less-probable edge in our model or not. Although, if we include too many or very few edges, we will end up not learning a good model. Therefore, this decision usually depends on our application.

Other than knowledge discovery, we very often use the models for density estimation. In this case, we would like to learn a model that should be able to learn the underlying probability distribution, , using our model, and should be able to make predictions over new data points. We may think that in this case, adding less-probable edges to the model will help us learn, as we should be able to learn using a more complex model. However, it turns out that our intuition is wrong in this case.

Let's get back to our two-coin tossing example and consider that our dataset consists of 50 samples with the following frequencies: 11 heads/heads, 10 heads/tails, 14 tails/heads, and 15 tails/tails. As all the frequencies are not equal, the data suggests that the two variables, X and Y, are not independent. So, let's consider that while learning using this data, we added an edge between X and Y. In this case, using the maximum likelihood estimator, we get the following parameters: , , , , , and . Whereas, in a case when we do not consider any edges between X and Y, we get the following parameters for Y: and . It was definitely possible for the probabilities to be skewed, even when we didn't consider any edges between the two variables. In the case of a more complex model, it is more probable that the parameters will be more skewed. This happens because in more complex models, we have lesser data to compute the parameters because of the conditioning. So, for example, while computing , we will only consider data samples for which . Hence, we are left only with 29 samples; whereas in the case of the model when we had no edges, we considered all 50 samples for computing the probability values.

Hence, it's often better to consider simpler models in the case of density estimation problems. A simpler model might not be able to represent the underlying distribution, , very well, but it can be a much better model to generalize over the dataset and will give much better results on new data points.

Methods for the learning structure

In general, there are are three main ways to learn structure. We will be giving a short introduction to each of them in this section; we will go into details in the later sections.

Constraint-based structure learning: The constraint-based structure learning method works on the basis of considering a Bayesian network to be a set of dependence conditions between the random variables. So, in this method, we try to find the dependence conditions from the data given to us. Using these conditions, we then try to construct a network. One major drawback of this method is that if we get a wrong result from our dependence tests, our whole learning fails.
Score-based structure learning: In this method, we consider the Bayesian network as a statistical model. We then define a hypothesis space of possible structures and a scoring function that tells us how close our structure is to the underlying structure. Based on these results, we then try to select the model that represents our underlying structure most closely. As this learning method considers the whole model at once, it is able to give better results than constraint-based learning. The problem with this model is that as our hypothesis space can be very large, finding the most optimal structure is hard. Hence, we generally resort to heuristic search techniques.
Bayesian model averaging: In this method, we try to apply concepts similar to the ones we saw in earlier sections to learn many structures, and then use an ensemble of all these structures. As the number of network structures can be huge, we sometimes have to use some approximate methods to do this.

Constraint-based structure learning

In this method, we try to construct the network structure using the independence conditions obtained from the data. In other words, we try to construct a minimal I-map, given the independence conditions.

Hence, once we have the independence conditions, we can construct a network structure using the algorithm that we discussed earlier, but how do we answer these independence queries from our data?

As we can expect, this question has been extensively studied in statistics and there are numerous methods to answer such queries. We will discuss one of these queries, which is known as the hypothesis testing method. We know that if two random variables are independent, they should satisfy the following condition:

So, in our case, we would like to check whether . However, in real-life problems, we don't know and , and therefore, we use the following equation to check our hypothesis:

Now, using the data samples, we can check whether this equation holds for our data or not. To do this, we will need a decision rule that will tell us whether the two variables are independent given the data samples.

A decision rule should basically compare the two distributions and be able to give a result on whether the independence holds or not. If we go for a very liberal decision rule, it will return the two variables to be independent even when they are not. Similarly, if we consider a very tight-bound decision rule, it will result in saying that the two variables are dependent even when they are independent. A standard way to design a decision function is to measure the distribution's deviance from the independence condition.

For two random variables, X and Y, to be independent, we can expect their count, in the dataset to be somewhere around . Here, M is the total number of data samples that we have. Specially, in the cases when M is large, this condition should be satisfied. Based on this intuition, we will now derive a deviance measure, commonly known as statistic:

We can clearly see here that when our data fits our independence assumptions perfectly, , and the farther it is from our assumption, the greater value it returns.

Another deviance measure technique based on counts is mutual information, , and is defined as follows:

Now, using any such deviance measure, we can define a threshold based on which our decision function will make decisions on whether two random variables are independent or not:

So, simply put, if the deviance measure is more than the threshold that we have given, the decision function will return the variables as dependent, and if not, they will be independent.

Structure score learning

As we discussed earlier, the score-based method uses a score function to score all the structures in our hypothesis space. Then, using the scores of all the models, we try to select the most optimal structure. So, in this learning method, the most important decision that we have to make is which scoring function to choose. Let's discuss two of the most commonly-used scoring functions.

The likelihood score

As we discussed earlier, the likelihood function gives us the probability of our data given the parameters of the model. So, we would like to select a model that has the maximum likelihood. In the case of structure learning, we want to learn both, the structure and the parameters of the structure. Therefore, our hypothesis space would be much larger than what we saw in the case of parameter learning.

Let's denote our graph and its parameters as . Now, we want to find the value using the following equation:

Here, represents the maximum likelihood parameters for the graph, G. Therefore, in simple words, we want to find a graph that has the maximum likelihood when we use the maximum likelihood parameters for it.

To get more insight on this method, let's take the previous example of tossing two coins. So, there are two possibilities for the network structure. One where both the random variables, X and Y, are independent, and thus have no edges between them. The other possibility is to have a network structure, where X is the parent to Y, that is, . Considering the network of the independent case to be , we can get its likelihood score as follows:

Considering the other model, , as , we can write its likelihood score as follows:

Here, the values are the maximum likelihood estimates. Let's consider the difference of these likelihood scores:

Now, let be the empirical distribution over the data. Therefore, we can say that and also . Also, we have and . Replacing these values in the preceding equation, we get the following equation:

Here, is the mutual information between X and Y in the distribution, . Hence, we see here that a higher mutual information means there is a stronger connection between the variables, X and Y, and therefore, the model, , is the more optimal choice.

We can actually generalize this for general networks. We already know that we can write the log-likelihood function as follows:

Let's consider a single term from the earlier equation and :

Here, the mutual information is, when . Also, the second term in the equation, , doesn't depend on the network structure, and therefore, we can ignore this term when comparing the likelihoods of models.

This result tells us that the likelihood score of the structure measures the strength of the dependencies of the variables and their parents.

In this section, until now, we have seen how the likelihood score works. In the case of generalizing the model for newer data points, the likelihood score gives poor results. We can take the example of tossing two coins. As we saw earlier, the difference of their likelihoods is as follows:

As we know, the mutual information between two variables is always non-negative. Hence, the likelihood score of the network, , will always be higher than the case when the two variables are independent. Hence, we see that the likelihood score always gives preference to more complex models over simpler models.

Also, as we never have completely independent variables in our data samples because of the added noise, likelihood scores will always select a fully-connected graph over all the variables, as it would be the most complex structure and, hence, would overfit the training data and not give good prediction results over new queries.

Let's see an example of tossing two coins using pgmpy:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pgmpy.models import BayesianModel
In [4]: from pgmpy.estimators import MaximumLikelihoodEstimator

# Generating random data 
In [5]: raw_data = np.random.randint(low=0, high=2,
                                     size=(1000, 2))
In [6]: data = pd.DataFrame(raw_data, columns=['X', 'Y'])

In [7]: coin_model = BayesianModel()
In [8]: coin_model.fit(data, estimator=MaximumLikelihoodEstimator)

In [9]: coin_model.get_cpds()
Out[9]:
[<TabularCPD representing P(X: 2) at 0x7f57bd99a588>,
 <TabularCPD representing P(Y: 2 | X: 2) at 0x7f57bd99a198>]

In [10]:coin_model.get_nodes()
Out[10]: ['X', 'Y']

In [11]: coin_model.get_edges()
Out[11]:[('X', 'Y')]

The Bayesian score

In the preceding section, we saw scoring based on likelihood and also saw how it is prone to overfitting. Now, in this section, we will discuss another method of scoring from a Bayesian perspective. As we saw in the case of parameter learning, we will have to assign prior probabilities in this case as well. So, we will assign a prior probability, P(G), to the structure of the network, and a prior probability, , to the parameters of this network structure.

From the Bayes' rule, we know the following equation:

Here again, the denominator is just a normalizing factor, and therefore, we will ignore it and define the Bayesian score as follows:

The addition of a prior distribution term in the scoring function allows us to have control over the complexity of the model. Therefore, we assign smaller prior values on more complex models, and thus, we are able to penalize the complex models.

The other term in our scoring function, , takes care of the uncertainty in the parameters:

Here, is the likelihood of the data, when a network and its parameters is given, and is our prior distribution over different values of for a given network, G.

The Bayesian approach does tell us that the parameter, , is most probable when the dataset D is given. However, the posterior also gives us a range of choices on how likely each of these is. By integrating over , we are thus measuring the expected likelihood over our parameters, .

Now, let's see how to compute the marginal likelihoods in simpler cases. Let's consider a single random variable, X, with a prior distribution, . Also, consider that our data set contains M[1] heads and M[0] tails. The maximum likelihood is as follows:

Now, let's consider the marginal likelihood. We need to compute the probability over our data, , given the prior. Using the chain rule, we have the following equation:

Using the Beta prior, we have the following equation:

Here, is the number of heads in the first m samples of the dataset. We can take an example of the dataset, :

Using the values, and , we have the following equation:

This value is significantly lower than the likelihood.

In general, for a binomial distribution with a Beta prior, we have the following equation:

Note here that all the terms inside the square brackets are the products of a sequence of numbers. If is an integer, we can write this term as The Bayesian score , but in this case, we don't know whether is an integer. So, we use a generalized gamma function to represent such terms: