Bayesian parameter estimation

In the preceding section, we discussed the method of estimating parameters using the maximum likelihood, but as it turns out, our maximum likelihood method has a lot of drawbacks. Let's consider the case of tossing a fair coin 10 times. Let's say that we got heads three times. Now, for this dataset, if we go with maximum likelihood, we will have the parameter, Bayesian parameter estimation, but our prior knowledge says that this should not be true. Also, if we get the same results of tossing with a biased coin, we will have the same parameter values. Maximum likelihood fails in accounting for the situation where, because of our prior knowledge, the probability of getting a head in the case of a fair coin should be more or less than in the case of a biased coin, even if we had the same dataset.

Another problem that occurs with a maximum likelihood estimate is that it fails to distinguish between the cases when we get three heads out of 10 tosses and when we get 30000 heads out of 100000 tosses. In both of these cases, the parameter, Bayesian parameter estimation, will be 0.3 according to maximum likelihood, but in reality, we should be more confident of this parameter in the second case.

So, to account for these errors, we move on to another approach that uses Bayesian statistics to estimate the parameters. In the Bayesian approach, we first create a probability distribution representing our prior knowledge about how likely are we to believe in the different choices of parameters. After this, we combine the prior knowledge with the dataset and create a joint distribution that captures our prior beliefs, as well as information from the data. Coming back to the example of coin flipping, let's say that we have a prior distribution, Bayesian parameter estimation. Also, from the data, we define our likelihood as follows:

Bayesian parameter estimation

Now, we can use this to define a joint distribution over the data, D, and the parameter, Bayesian parameter estimation:

Bayesian parameter estimation

Here, Bayesian parameter estimation is the number of heads in the data and Bayesian parameter estimation is the number of tails. Using the preceding equation, we can compute the posterior distribution over Bayesian parameter estimation:

Bayesian parameter estimation

Here, the first term of the numerator is known as the likelihood, the second is known as the prior, and the denominator is the normalizing factor.

In the case of Bayesian estimation, if we take a uniform prior, it will give the same results as the maximum likelihood approach. So, we won't be selecting any particular value of Bayesian parameter estimation in this case. We will try to predict the outcome of the next coin toss, when all the previous data samples are given:

Bayesian parameter estimation
Bayesian parameter estimation
Bayesian parameter estimation

In simple words, here we are integrating our posterior distribution over Bayesian parameter estimation to find the probability of heads for the next toss.

Now, applying this concept of the Bayesian estimator to our coin tossing example, let's assume that we have a uniform prior over Bayesian parameter estimation, which can take values in the interval, [0, 1]. Then, Bayesian parameter estimation will be proportional to the likelihood, Bayesian parameter estimation. Let's put this value in the integral:

Bayesian parameter estimation

Solving this equation, we finally get the following equation:

Bayesian parameter estimation

This prediction is known as the Bayesian estimator. We can clearly see from the preceding equation that as the number of samples increase, the parameters comes closer and closer to the maximum likelihood estimate. The estimator that corresponds to a uniform prior is often referred to as Laplace's correction.

Priors

In the preceding section, we discussed the case when we have uniform priors. As we saw, in the case of uniform priors, the estimator is not very different from the maximum likelihood estimator. Therefore, in this section, we will move on to discuss the case when we have a non-uniform prior. We will show an example over our coin tossing example, using our prior to be a Beta distribution.

A Beta distribution is defined in the following way:

Priors

Here, Priors and Priors are the parameters, and the constant, Priors, is a normalizing constant, which is defined as follows:

Priors

Here, the gamma function, Priors, is defined as follows:

Priors

For now, before we start our observations, let's consider that the hyper parameters, Priors and Priors, correspond to the imaginary number of tails and heads. To make our statement more concrete, let's consider the example of a single coin toss and assume that our distribution, Priors. Now, let's try to compute the marginal probability:

Priors

So, this conclusion shows that our statement about the hyper parameters is correct. Now, extending this computation for the case when we saw M[1] heads and M[0] tails, we get the following equation:

Priors

This equation shows that if the prior distribution is a Beta distribution, the posterior distribution also turns out to be a Beta distribution. Now, using these properties, we can easily compute the probability over the next toss:

Priors

Here, Priors, and this posterior represents that we have already seen Priors heads and Priors tails.

Bayesian parameter estimation for Bayesian networks

Again, let's take our simple example of the network, Bayesian parameter estimation for Bayesian networks, and our training data, Bayesian parameter estimation for Bayesian networks. We also have unknown parameters, Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks. We can think of a dependency network over the parameters and data, as shown in Fig 5.2.

This dependency structure gives us a lot of information about datasets and our parameters. We can easily see from the network that different data instances are independent of each other if the parameters are given. So, Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks are d-separated from Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks when Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks are given.

Also, when all the Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks values are observed, the parameters, Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks, are d-separated. We can very easily prove this statement, as any path between Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks is in the following form:

Bayesian parameter estimation for Bayesian networks

When Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks are observed, influence cannot flow between Bayesian parameter estimation for Bayesian networks and Bayesian parameter estimation for Bayesian networks. So, if these two parameters are independent a priori then they will also be independent a posteriori. This d-separation condition leads us to the following result:

Bayesian parameter estimation for Bayesian networks

This condition is similar to what we saw in the case of the maximum likelihood estimation. This will allow us to break up the estimation problem into smaller and simpler problems, as shown in the following figure:

Bayesian parameter estimation for Bayesian networks

Fig 5.2: Network structure of data samples and parameters of the network

Now, using the preceding results, let's formalize our problem and see how the results help us solve it. So, we are provided with a network structure, G, whose parameters are Bayesian parameter estimation for Bayesian networks. We need to assign a prior distribution over the network parameters, Bayesian parameter estimation for Bayesian networks. We define the posterior distribution over these parameters as follows:

Bayesian parameter estimation for Bayesian networks

Here, the term, Bayesian parameter estimation for Bayesian networks, is our prior distribution, Bayesian parameter estimation for Bayesian networks is the likelihood function, Bayesian parameter estimation for Bayesian networks is our posterior distribution, and P(D) is the normalizing constant.

As we had discussed earlier, we can split our likelihood function as follows:

Bayesian parameter estimation for Bayesian networks

Also, let's consider that our parameters are independent:

Bayesian parameter estimation for Bayesian networks

Combining these two equations, we get the following equation:

Bayesian parameter estimation for Bayesian networks

In the preceding equation, we can see that each of the product terms is for a local parameter value. With this result, let's now try to find the probability of a new data instance given our previous observations:

Bayesian parameter estimation for Bayesian networks

As we saw earlier, all the data instances are independent. If the parameter is given, we get the following equation:

Bayesian parameter estimation for Bayesian networks

We can also decompose the posterior probability as follows:

Bayesian parameter estimation for Bayesian networks
Bayesian parameter estimation for Bayesian networks
Bayesian parameter estimation for Bayesian networks

Now, using this equation, we can solve the prediction problem for each of the variables separately.

Now, let's see some examples of the network's learning parameters using this Bayesian approach on the late-for-school model:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pgmpy.models import BayesianModel
In [4]: from pgmpy.estimators import BayesianEstimator

# Generating some random data
In [5]: raw_data = np.random.randint(low=0, high=2, 

                                     size=(1000, 6))
In [6]: print(raw_data)
Out[6]:
array([[1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 1, 1],
       [0, 1, 0, 0, 1, 1],
       ..., 
       [1, 1, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1, 1]])
In [7]: data = pd.DataFrame(raw_data, columns=['A', 'R', 'J',
                                               'G', 'L', 'Q'])

# Creating the network structures
In [8]: student_model = BayesianModel([('A', 'J'), ('R', 'J'),
                                       ('J', 'Q'), ('J', 'L'), 

                                       ('G', 'L')])
In [9]: student_model.fit(data, estimator=BayesianEstimator)
In [10]: student_model.get_cpds()
Out[10]:
[<TabularCPD representing P(A: 2) at 0x7f92892304fa>,
 <TabularCPD representing P(R: 2) at 0x7f9286c9323b>,
 <TabularCPD representing P(G: 2) at 0x7f9436c9833b>,
 <TabularCPD representing P(J: 2 | A: 2, R: 2) at 0x7f9286s23a34>,
 <TabularCPD representing P(L: 2 | J: 2, G: 2) at
                                            
0x7f9286a932b30>,
 <TabularCPD representing P(Q: 2 | J: 2) at 0x7f9286d12904>]

In [11]: print(student_model.get_cpds('D'))
Out[11]:
╒═════╤═════╕

╘═════╧═════╛
╒═════╤═════╕
│ D_0 │ 0.44│
├─────┼─────┤
│ D_1 │ 0.56│
╘═════╧═════╛

Therefore, to learn the data using the Bayesian approach, we just need to pass the estimator type BayesianEstimator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset