Structure learning in Bayesian networks

In the previous sections, we considered that we already know the network structure and we tried to estimate the parameters of the network using the data. However, it is quite possible that we might neither know the network structure nor have the domain knowledge to construct the network. Hence, in this section, we will discuss constructing the model structure when the data is given.

Constructing the model from the data is a difficult problem. Let's take an example of tossing two coins and representing the outcome of the first with the variable, X, and the second with the variable, Y. We know that if the coins are fair, these two random variables should be independent of each other. However, to get this independence condition just from the data, we need to have all these outcomes equal number of times in the data that we will rarely see in real life.

So, in general, we need to make some assumptions about the dependencies. The assumptions that we make will largely depend on our learning task, as we discussed in the previous sections. Now, in the case of knowledge discovery, we would like to know the dependencies between the variables; therefore, we would like our network structure to be as accurate as possible. We know that each distribution can have many P-maps; therefore, the best we can do is get an I-equivalent structure of the original network, Structure learning in Bayesian networks. As we mentioned earlier, it is really hard to get the exact network structure, so we will often have a situation where we have to decide whether we want to include a less-probable edge in our model or not. Although, if we include too many or very few edges, we will end up not learning a good model. Therefore, this decision usually depends on our application.

Other than knowledge discovery, we very often use the models for density estimation. In this case, we would like to learn a model that should be able to learn the underlying probability distribution, Structure learning in Bayesian networks, using our model, and should be able to make predictions over new data points. We may think that in this case, adding less-probable edges to the model will help us learn, as we should be able to learn Structure learning in Bayesian networks using a more complex model. However, it turns out that our intuition is wrong in this case.

Let's get back to our two-coin tossing example and consider that our dataset consists of 50 samples with the following frequencies: 11 heads/heads, 10 heads/tails, 14 tails/heads, and 15 tails/tails. As all the frequencies are not equal, the data suggests that the two variables, X and Y, are not independent. So, let's consider that while learning using this data, we added an edge between X and Y. In this case, using the maximum likelihood estimator, we get the following parameters: Structure learning in Bayesian networks, Structure learning in Bayesian networks, Structure learning in Bayesian networks, Structure learning in Bayesian networks, Structure learning in Bayesian networks, and Structure learning in Bayesian networks. Whereas, in a case when we do not consider any edges between X and Y, we get the following parameters for Y: Structure learning in Bayesian networks and Structure learning in Bayesian networks. It was definitely possible for the probabilities to be skewed, even when we didn't consider any edges between the two variables. In the case of a more complex model, it is more probable that the parameters will be more skewed. This happens because in more complex models, we have lesser data to compute the parameters because of the conditioning. So, for example, while computing Structure learning in Bayesian networks, we will only consider data samples for which Structure learning in Bayesian networks. Hence, we are left only with 29 samples; whereas in the case of the model when we had no edges, we considered all 50 samples for computing the probability values.

Hence, it's often better to consider simpler models in the case of density estimation problems. A simpler model might not be able to represent the underlying distribution, Structure learning in Bayesian networks, very well, but it can be a much better model to generalize over the dataset and will give much better results on new data points.

Methods for the learning structure

In general, there are are three main ways to learn structure. We will be giving a short introduction to each of them in this section; we will go into details in the later sections.

  • Constraint-based structure learning: The constraint-based structure learning method works on the basis of considering a Bayesian network to be a set of dependence conditions between the random variables. So, in this method, we try to find the dependence conditions from the data given to us. Using these conditions, we then try to construct a network. One major drawback of this method is that if we get a wrong result from our dependence tests, our whole learning fails.
  • Score-based structure learning: In this method, we consider the Bayesian network as a statistical model. We then define a hypothesis space of possible structures and a scoring function that tells us how close our structure is to the underlying structure. Based on these results, we then try to select the model that represents our underlying structure most closely. As this learning method considers the whole model at once, it is able to give better results than constraint-based learning. The problem with this model is that as our hypothesis space can be very large, finding the most optimal structure is hard. Hence, we generally resort to heuristic search techniques.
  • Bayesian model averaging: In this method, we try to apply concepts similar to the ones we saw in earlier sections to learn many structures, and then use an ensemble of all these structures. As the number of network structures can be huge, we sometimes have to use some approximate methods to do this.

Constraint-based structure learning

In this method, we try to construct the network structure using the independence conditions obtained from the data. In other words, we try to construct a minimal I-map, given the independence conditions.

Hence, once we have the independence conditions, we can construct a network structure using the algorithm that we discussed earlier, but how do we answer these independence queries from our data?

As we can expect, this question has been extensively studied in statistics and there are numerous methods to answer such queries. We will discuss one of these queries, which is known as the hypothesis testing method. We know that if two random variables are independent, they should satisfy the following condition:

Constraint-based structure learning

So, in our case, we would like to check whether Constraint-based structure learning. However, in real-life problems, we don't know Constraint-based structure learning and Constraint-based structure learning, and therefore, we use the following equation to check our hypothesis:

Constraint-based structure learning

Now, using the data samples, we can check whether this equation holds for our data or not. To do this, we will need a decision rule that will tell us whether the two variables are independent given the data samples.

A decision rule should basically compare the two distributions and be able to give a result on whether the independence holds or not. If we go for a very liberal decision rule, it will return the two variables to be independent even when they are not. Similarly, if we consider a very tight-bound decision rule, it will result in saying that the two variables are dependent even when they are independent. A standard way to design a decision function is to measure the distribution's deviance from the independence condition.

For two random variables, X and Y, to be independent, we can expect their count, Constraint-based structure learning in the dataset to be somewhere around Constraint-based structure learning. Here, M is the total number of data samples that we have. Specially, in the cases when M is large, this condition should be satisfied. Based on this intuition, we will now derive a deviance measure, commonly known as Constraint-based structure learning statistic:

Constraint-based structure learning

We can clearly see here that when our data fits our independence assumptions perfectly, Constraint-based structure learning, and the farther it is from our assumption, the greater value it returns.

Another deviance measure technique based on counts is mutual information, Constraint-based structure learning, and is defined as follows:

Constraint-based structure learning

Now, using any such deviance measure, we can define a threshold based on which our decision function will make decisions on whether two random variables are independent or not:

Constraint-based structure learning

So, simply put, if the deviance measure is more than the threshold that we have given, the decision function will return the variables as dependent, and if not, they will be independent.

Structure score learning

As we discussed earlier, the score-based method uses a score function to score all the structures in our hypothesis space. Then, using the scores of all the models, we try to select the most optimal structure. So, in this learning method, the most important decision that we have to make is which scoring function to choose. Let's discuss two of the most commonly-used scoring functions.

The likelihood score

As we discussed earlier, the likelihood function gives us the probability of our data given the parameters of the model. So, we would like to select a model that has the maximum likelihood. In the case of structure learning, we want to learn both, the structure and the parameters of the structure. Therefore, our hypothesis space would be much larger than what we saw in the case of parameter learning.

Let's denote our graph and its parameters as The likelihood score. Now, we want to find the value using the following equation:

The likelihood score
The likelihood score

Here, The likelihood score represents the maximum likelihood parameters for the graph, G. Therefore, in simple words, we want to find a graph that has the maximum likelihood when we use the maximum likelihood parameters for it.

To get more insight on this method, let's take the previous example of tossing two coins. So, there are two possibilities for the network structure. One where both the random variables, X and Y, are independent, and thus have no edges between them. The other possibility is to have a network structure, where X is the parent to Y, that is, The likelihood score. Considering the network of the independent case to be The likelihood score, we can get its likelihood score as follows:

The likelihood score

Considering the other model, The likelihood score, as The likelihood score, we can write its likelihood score as follows:

The likelihood score

Here, the The likelihood score values are the maximum likelihood estimates. Let's consider the difference of these likelihood scores:

The likelihood score
The likelihood score

Now, let The likelihood score be the empirical distribution over the data. Therefore, we can say that The likelihood score and also The likelihood score. Also, we have The likelihood score and The likelihood score. Replacing these values in the preceding equation, we get the following equation:

The likelihood score

Here, The likelihood score is the mutual information between X and Y in the distribution, The likelihood score. Hence, we see here that a higher mutual information means there is a stronger connection between the variables, X and Y, and therefore, the model, The likelihood score, is the more optimal choice.

We can actually generalize this for general networks. We already know that we can write the log-likelihood function as follows:

The likelihood score

Let's consider a single term from the earlier equation and The likelihood score:

The likelihood score

Here, the mutual information is, The likelihood score when The likelihood score. Also, the second term in the equation, The likelihood score, doesn't depend on the network structure, and therefore, we can ignore this term when comparing the likelihoods of models.

This result tells us that the likelihood score of the structure measures the strength of the dependencies of the variables and their parents.

In this section, until now, we have seen how the likelihood score works. In the case of generalizing the model for newer data points, the likelihood score gives poor results. We can take the example of tossing two coins. As we saw earlier, the difference of their likelihoods is as follows:

The likelihood score

As we know, the mutual information between two variables is always non-negative. Hence, the likelihood score of the network, The likelihood score, will always be higher than the case when the two variables are independent. Hence, we see that the likelihood score always gives preference to more complex models over simpler models.

Also, as we never have completely independent variables in our data samples because of the added noise, likelihood scores will always select a fully-connected graph over all the variables, as it would be the most complex structure and, hence, would overfit the training data and not give good prediction results over new queries.

Let's see an example of tossing two coins using pgmpy:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pgmpy.models import BayesianModel
In [4]: from pgmpy.estimators import MaximumLikelihoodEstimator

# Generating random data 
In [5]: raw_data = np.random.randint(low=0, high=2,
                                     size=(1000, 2))
In [6]: data = pd.DataFrame(raw_data, columns=['X', 'Y'])

In [7]: coin_model = BayesianModel()
In [8]: coin_model.fit(data, estimator=MaximumLikelihoodEstimator)

In [9]: coin_model.get_cpds()
Out[9]:
[<TabularCPD representing P(X: 2) at 0x7f57bd99a588>,
 <TabularCPD representing P(Y: 2 | X: 2) at 0x7f57bd99a198>]

In [10]:coin_model.get_nodes()
Out[10]: ['X', 'Y']

In [11]: coin_model.get_edges()
Out[11]:[('X', 'Y')]

The Bayesian score

In the preceding section, we saw scoring based on likelihood and also saw how it is prone to overfitting. Now, in this section, we will discuss another method of scoring from a Bayesian perspective. As we saw in the case of parameter learning, we will have to assign prior probabilities in this case as well. So, we will assign a prior probability, P(G), to the structure of the network, and a prior probability, The Bayesian score, to the parameters of this network structure.

From the Bayes' rule, we know the following equation:

The Bayesian score

Here again, the denominator is just a normalizing factor, and therefore, we will ignore it and define the Bayesian score as follows:

The Bayesian score

The addition of a prior distribution term in the scoring function allows us to have control over the complexity of the model. Therefore, we assign smaller prior values on more complex models, and thus, we are able to penalize the complex models.

The other term in our scoring function, The Bayesian score, takes care of the uncertainty in the parameters:

The Bayesian score

Here, The Bayesian score is the likelihood of the data, when a network and its parameters is given, and The Bayesian score is our prior distribution over different values of The Bayesian score for a given network, G.

The Bayesian approach does tell us that the parameter, The Bayesian score, is most probable when the dataset D is given. However, the posterior also gives us a range of choices on how likely each of these is. By integrating The Bayesian score over The Bayesian score, we are thus measuring the expected likelihood over our parameters, The Bayesian score.

Now, let's see how to compute the marginal likelihoods in simpler cases. Let's consider a single random variable, X, with a prior distribution, The Bayesian score. Also, consider that our data set contains M[1] heads and M[0] tails. The maximum likelihood is as follows:

The Bayesian score

Now, let's consider the marginal likelihood. We need to compute the probability over our data, The Bayesian score, given the prior. Using the chain rule, we have the following equation:

The Bayesian score

Using the Beta prior, we have the following equation:

The Bayesian score

Here, The Bayesian score is the number of heads in the first m samples of the dataset. We can take an example of the dataset, The Bayesian score:

The Bayesian score

Using the values, The Bayesian score and The Bayesian score, we have the following equation:

The Bayesian score

This value is significantly lower than the likelihood.

In general, for a binomial distribution with a Beta prior, we have the following equation:

The Bayesian score

Note here that all the terms inside the square brackets are the products of a sequence of numbers. If The Bayesian score is an integer, we can write this term as The Bayesian score, but in this case, we don't know whether The Bayesian score is an integer. So, we use a generalized gamma function to represent such terms:

The Bayesian score

Using this result in our earlier equation, we get the following equation:

The Bayesian score

We can have a generalized formula for multinomial distributions as well:

The Bayesian score
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset