Parameter learning

In the previous sections, we have been discussing the general concepts related to learning. Now, in this section, we will be discussing the problem of learning parameters. In this case, we will already know the networks structure and we will have a dataset, Parameter learning, of full assignment over the variables. We have two major approaches to estimate the parameters, the maximum likelihood estimation and the Bayesian approach.

Maximum likelihood estimation

Let's take the example of a biased coin. We want to predict the outcome of this coin using previous data that we have about the outcomes of tossing it. So, let's consider that, previously, we tossed the coin 1000 times and we got heads 330 times and got tails 670 times. Based on this observation, we can define a parameter, Maximum likelihood estimation, which represents our chances of getting a heads or a tails in the next toss. In the most simple case, we can have this parameter, Maximum likelihood estimation, to be the probability of getting a heads or tails. Considering Maximum likelihood estimation to be the probability of getting a heads, we have Maximum likelihood estimation. Now, using this parameter, we are able to have an estimate of the outcome of our next toss. Also, as we increase the number of data samples that we used to compute the parameter, we will get more confident about the parameter.

Putting this all formally, let's consider that we have a set of independent and identically distributed coin tosses, Maximum likelihood estimation. Each Maximum likelihood estimation can either take the value heads, (H), with the probability, Maximum likelihood estimation, or tails, (T), with probability, Maximum likelihood estimation. We want to find a good value for the parameter, Maximum likelihood estimation, so that we can predict the outcomes of the future tosses. As we discussed in the previous sections, we usually approach a learning task by defining a hypothesis space, Maximum likelihood estimation, and an optimization function. In this case, as we are trying to get the probability of a single random variable, we can define our hypothesis space as follows:

Maximum likelihood estimation

Now, let's take an example that we have, namely Maximum likelihood estimation. When the value of Maximum likelihood estimation is given, we can compute the probability of observing this data. We can easily say that Maximum likelihood estimation. Also, Maximum likelihood estimation, as all the observations are independent. Now, consider the following equation:

Maximum likelihood estimation

This is the probability of our data to conform with our parameter, Maximum likelihood estimation, which is also known as the likelihood, as we had discussed in the earlier section. Now, as we want our parameter to agree with the data as much as possible, we would like the likelihood, Maximum likelihood estimation, to be as high as possible. Plotting the curve of Maximum likelihood estimation within our hypothesis space, we get the following curve:

Maximum likelihood estimation

Fig 5.1: Curve showing the variation of likelihood with

Maximum likelihood estimation

From the curve in Fig 5.1, we can now easily see that we get the maximum likelihood at Maximum likelihood estimation.

Now, let's try to generalize this computation. Also, let's consider that in our dataset, we have Maximum likelihood estimation number of heads and Maximum likelihood estimation number of tails:

Maximum likelihood estimation
Maximum likelihood estimation

From the example we saw earlier, we can now easily derive the following equation:

Maximum likelihood estimation

Now, we would like to maximize this likelihood to get the most optimum value for Maximum likelihood estimation. However, as it turns out it, it is much easier to work with log-likelihood, and as log-likelihood is monotonically related to the likelihood function, the optimum value of Maximum likelihood estimation for the likelihood function would be the same as that for the log-likelihood function. So, first taking the log of the preceding function, we get the following equation:

Maximum likelihood estimation

To find the maxima, we now take the derivative of this function and equate it to 0. We get the following result:

Maximum likelihood estimation
Maximum likelihood estimation

Hence, we get our maximum likelihood parameter for the generalized case.

Maximum likelihood principle

In the preceding section, we saw how to apply the maximum likelihood estimator in a simple single variable case. In this section, we will now discuss how to apply this to a broader range of learning problems and how to use this to learn the parameters in the case of a Bayesian network.

Now, let's define our generalized learning problem. We assume that we are provided with a dataset, Maximum likelihood principle, containing the IID samples over a set of variables, Maximum likelihood principle. We also assume that we know the sample space of the data, that is, we know the variables and the values that it can take. For our learning, we are provided with a parametric model, whose parameters we want to learn. A parametric model is defined as a function, Maximum likelihood principle, that assigns a probability to Maximum likelihood principle, when a set of parameters is given, Maximum likelihood principle. As this parametric model is a probability distribution, it should be non-negative and should sum up to 1:

Maximum likelihood principle

As we have defined our learning problem, we will now move on to applying our maximum likelihood principle on this. So, first of all, we need to define the parameter space for our model. Let's take a few examples to make defining the space clearer.

Let's consider the case of a multinomial distribution, P, which is defined over a set of variables, X, and can take the values, Maximum likelihood principle. The distribution is represented as Maximum likelihood principle:

Maximum likelihood principle

The parameter space, Maximum likelihood principle, for this model can now be defined as follows:

Maximum likelihood principle

We can take another example of a Gaussian distribution on a random variable, X, such that X can take values from the real line. The distribution is defined as follows:

Maximum likelihood principle

For this model, our parameters are Maximum likelihood principle and Maximum likelihood principle. On defining Maximum likelihood principle = Maximum likelihood principle, our parameter space can be defined as Maximum likelihood principle.

Now that we have seen how to define our parameter space, the next step is to define our likelihood function. We can define our likelihood function on our data, D, as Maximum likelihood principle and it can be expressed as follows:

Maximum likelihood principle

Now, using the earlier parameter space and likelihood functions, we can move forward and compute the maxima of the likelihood or log-likelihood function to find the most optimal value of our parameter, Maximum likelihood principle. Taking the logarithm of both sides of the likelihood function, we get the following equation:

Maximum likelihood principle

Now, let's equate this with 0 to find the maxima:

Maximum likelihood principle

We can then solve this equation to get our desired Maximum likelihood principle.

The maximum likelihood estimate for Bayesian networks

Let's now move to the problem of estimating the parameters in a Bayesian network. In the case of Bayesian networks, the network structure helps us reduce the parameter estimation problem to a set of unrelated problems, and each of these problems can be solved using techniques discussed in the previous sections.

Let's take a simple example of the network, The maximum likelihood estimate for Bayesian networks. For this network, we can think of the parameters, The maximum likelihood estimate for Bayesian networks and The maximum likelihood estimate for Bayesian networks, which will specify the probability of the variable X; The maximum likelihood estimate for Bayesian networks and The maximum likelihood estimate for Bayesian networks, which will specify the probability of The maximum likelihood estimate for Bayesian networks, and The maximum likelihood estimate for Bayesian networks and The maximum likelihood estimate for Bayesian networks representing the probability of The maximum likelihood estimate for Bayesian networks.

Consider that we have the samples in the form of The maximum likelihood estimate for Bayesian networks, where The maximum likelihood estimate for Bayesian networks denotes assignments to the variable, X, and The maximum likelihood estimate for Bayesian networks denotes assignments to the variable, Y. Using this, we can define our likelihood function as follows:

The maximum likelihood estimate for Bayesian networks

Utilizing the network structure, we can write the joint distribution, P(X, Y), as follows:

The maximum likelihood estimate for Bayesian networks

Replacing the joint distribution in the preceding equation using this product form, we get the following equation:

The maximum likelihood estimate for Bayesian networks

So, we see that the Bayesian network's structure helped us decompose the likelihood function in simpler terms. We now have separate terms for each variable, each representing how well it is predicted, when its parents and parameters are given.

Here, the first term is the same as what we saw in previous sections. The second term can be decomposed further:

The maximum likelihood estimate for Bayesian networks
The maximum likelihood estimate for Bayesian networks
The maximum likelihood estimate for Bayesian networks

Thus, we see that we can decompose the likelihood function into a term for each group of parameters. Actually, we can simplify this even further. Just consider a single term again:

The maximum likelihood estimate for Bayesian networks

These terms can take only two values. When The maximum likelihood estimate for Bayesian networks, it is equal to The maximum likelihood estimate for Bayesian networks, and when The maximum likelihood estimate for Bayesian networks, it is equal to The maximum likelihood estimate for Bayesian networks. Thus, we get the value, The maximum likelihood estimate for Bayesian networks, in cases when The maximum likelihood estimate for Bayesian networks and The maximum likelihood estimate for Bayesian networks. Let's denote this number by The maximum likelihood estimate for Bayesian networks. Thus, we can rewrite the earlier equation as follows:

The maximum likelihood estimate for Bayesian networks

From our preceding discussion, we know that to maximize the likelihood, we can set the following equation:

The maximum likelihood estimate for Bayesian networks

Now, using this equation, we can find all the parameters of the Bayesian network by simply counting the occurrence of different states of variables in the data.

Now, let's see some code examples for how to learn parameters using pgmpy:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pgmpy.models import BayesianModel
In [4]: from pgmpy.estimators import MaximumLikelihoodEstimator

# Generating some random data
In [5]: raw_data = np.random.randint(low=0, high=2, size=(100, 2))
In [6]: print(raw_data)
Out[6]:
array([[1, 1],
       [1, 1],
       [0, 1],
       ..., 
       [0, 0],
       [0, 0],
       [0, 0]])
In [7]: data = pd.DataFrame(raw_data, columns=['X', 'Y'])
In [8]: print(data)
Out[8]:
     X  Y
0    1  1
1    1  1
2    0  1
3    1  0
..  .. ..
996  1  1
997  0  0
998  0  0
999  0  0

[1000 rows x 2 columns]

# Two coin tossing model assuming that they are dependent.
In [9]: coin_model = BayesianModel([('X', 'Y')])
In [10]: coin_model.fit(data, 
                        estimator=MaximumLikelihoodEstimator)
In [11]: cpd_x = coin_model.get_cpds('X')
In [12]: print(cpd_x)
Out[12]:
╒═════╤═════╕
│ x_0 │ 0.46│
├─────┼─────┤
│ x_1 │ 0.54│
╘═════╧═════╛

Similarly, we can take the example of the late-for-school model:

In [13]: raw_data = np.random.randint(low=0, high=2, 

                                      size=(1000, 6))
In [14]: data = pd.DataFrame(raw_data, columns=['A', 'R', 'J',
                                                'G', 'L', ‘Q’])

In [15]: student_model = BayesianModel([('A', 'J'), ('R', 'J'),
                                          ('J', 'Q'), ('J', 'L'),

                                          ('G', 'L')])
In [16]: student_model.fit(data, 
                           estimator=MaximumLikelihoodEstimator)
In [17]: student_model.get_cpds()
Out[17]:
[<TabularCPD representing P(A: 2) at 0x7f9286b1fa113>,
 <TabularCPD representing P(R: 2) at 0x7f9283b12312>,
 <TabularCPD representing P(G: 2) at 0x7f9383b15114>
 <TabularCPD representing P(J: 2 | A: 2, R: 2) at 0x7f9286bw3329>,
 <TabularCPD representing P(Q: 2 | J: 2) at 0x7f92863kj3294>,

 <TabularCPD representing P(L: 2 | G: 2, J: 2) at 

                                           0x7f9282kj49345>]

So, learning parameters from data is very easy in pgmpy and requires just a call to the fit method.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset