In the previous sections, we have been discussing the general concepts related to learning. Now, in this section, we will be discussing the problem of learning parameters. In this case, we will already know the networks structure and we will have a dataset, , of full assignment over the variables. We have two major approaches to estimate the parameters, the maximum likelihood estimation and the Bayesian approach.
Let's take the example of a biased coin. We want to predict the outcome of this coin using previous data that we have about the outcomes of tossing it. So, let's consider that, previously, we tossed the coin 1000 times and we got heads 330 times and got tails 670 times. Based on this observation, we can define a parameter, , which represents our chances of getting a heads or a tails in the next toss. In the most simple case, we can have this parameter, , to be the probability of getting a heads or tails. Considering to be the probability of getting a heads, we have . Now, using this parameter, we are able to have an estimate of the outcome of our next toss. Also, as we increase the number of data samples that we used to compute the parameter, we will get more confident about the parameter.
Putting this all formally, let's consider that we have a set of independent and identically distributed coin tosses, . Each can either take the value heads, (H), with the probability, , or tails, (T), with probability, . We want to find a good value for the parameter, , so that we can predict the outcomes of the future tosses. As we discussed in the previous sections, we usually approach a learning task by defining a hypothesis space, , and an optimization function. In this case, as we are trying to get the probability of a single random variable, we can define our hypothesis space as follows:
Now, let's take an example that we have, namely . When the value of is given, we can compute the probability of observing this data. We can easily say that . Also, , as all the observations are independent. Now, consider the following equation:
This is the probability of our data to conform with our parameter, , which is also known as the likelihood, as we had discussed in the earlier section. Now, as we want our parameter to agree with the data as much as possible, we would like the likelihood, , to be as high as possible. Plotting the curve of within our hypothesis space, we get the following curve:
From the curve in Fig 5.1, we can now easily see that we get the maximum likelihood at .
Now, let's try to generalize this computation. Also, let's consider that in our dataset, we have number of heads and number of tails:
From the example we saw earlier, we can now easily derive the following equation:
Now, we would like to maximize this likelihood to get the most optimum value for . However, as it turns out it, it is much easier to work with log-likelihood, and as log-likelihood is monotonically related to the likelihood function, the optimum value of for the likelihood function would be the same as that for the log-likelihood function. So, first taking the log of the preceding function, we get the following equation:
To find the maxima, we now take the derivative of this function and equate it to 0. We get the following result:
Hence, we get our maximum likelihood parameter for the generalized case.
In the preceding section, we saw how to apply the maximum likelihood estimator in a simple single variable case. In this section, we will now discuss how to apply this to a broader range of learning problems and how to use this to learn the parameters in the case of a Bayesian network.
Now, let's define our generalized learning problem. We assume that we are provided with a dataset, , containing the IID samples over a set of variables, . We also assume that we know the sample space of the data, that is, we know the variables and the values that it can take. For our learning, we are provided with a parametric model, whose parameters we want to learn. A parametric model is defined as a function, , that assigns a probability to , when a set of parameters is given, . As this parametric model is a probability distribution, it should be non-negative and should sum up to 1:
As we have defined our learning problem, we will now move on to applying our maximum likelihood principle on this. So, first of all, we need to define the parameter space for our model. Let's take a few examples to make defining the space clearer.
Let's consider the case of a multinomial distribution, P, which is defined over a set of variables, X, and can take the values, . The distribution is represented as :
The parameter space, , for this model can now be defined as follows:
We can take another example of a Gaussian distribution on a random variable, X, such that X can take values from the real line. The distribution is defined as follows:
For this model, our parameters are and . On defining = , our parameter space can be defined as .
Now that we have seen how to define our parameter space, the next step is to define our likelihood function. We can define our likelihood function on our data, D, as and it can be expressed as follows:
Now, using the earlier parameter space and likelihood functions, we can move forward and compute the maxima of the likelihood or log-likelihood function to find the most optimal value of our parameter, . Taking the logarithm of both sides of the likelihood function, we get the following equation:
Now, let's equate this with 0 to find the maxima:
We can then solve this equation to get our desired .
Let's now move to the problem of estimating the parameters in a Bayesian network. In the case of Bayesian networks, the network structure helps us reduce the parameter estimation problem to a set of unrelated problems, and each of these problems can be solved using techniques discussed in the previous sections.
Let's take a simple example of the network, . For this network, we can think of the parameters, and , which will specify the probability of the variable X; and , which will specify the probability of , and and representing the probability of .
Consider that we have the samples in the form of , where denotes assignments to the variable, X, and denotes assignments to the variable, Y. Using this, we can define our likelihood function as follows:
Utilizing the network structure, we can write the joint distribution, P(X, Y), as follows:
Replacing the joint distribution in the preceding equation using this product form, we get the following equation:
So, we see that the Bayesian network's structure helped us decompose the likelihood function in simpler terms. We now have separate terms for each variable, each representing how well it is predicted, when its parents and parameters are given.
Here, the first term is the same as what we saw in previous sections. The second term can be decomposed further:
Thus, we see that we can decompose the likelihood function into a term for each group of parameters. Actually, we can simplify this even further. Just consider a single term again:
These terms can take only two values. When , it is equal to , and when , it is equal to . Thus, we get the value, , in cases when and . Let's denote this number by . Thus, we can rewrite the earlier equation as follows:
From our preceding discussion, we know that to maximize the likelihood, we can set the following equation:
Now, using this equation, we can find all the parameters of the Bayesian network by simply counting the occurrence of different states of variables in the data.
Now, let's see some code examples for how to learn parameters using pgmpy
:
In [1]: import numpy as np In [2]: import pandas as pd In [3]: from pgmpy.models import BayesianModel In [4]: from pgmpy.estimators import MaximumLikelihoodEstimator # Generating some random data In [5]: raw_data = np.random.randint(low=0, high=2, size=(100, 2)) In [6]: print(raw_data) Out[6]: array([[1, 1], [1, 1], [0, 1], ..., [0, 0], [0, 0], [0, 0]]) In [7]: data = pd.DataFrame(raw_data, columns=['X', 'Y']) In [8]: print(data) Out[8]: X Y 0 1 1 1 1 1 2 0 1 3 1 0 .. .. .. 996 1 1 997 0 0 998 0 0 999 0 0 [1000 rows x 2 columns] # Two coin tossing model assuming that they are dependent. In [9]: coin_model = BayesianModel([('X', 'Y')]) In [10]: coin_model.fit(data, estimator=MaximumLikelihoodEstimator) In [11]: cpd_x = coin_model.get_cpds('X') In [12]: print(cpd_x) Out[12]: ╒═════╤═════╕ │ x_0 │ 0.46│ ├─────┼─────┤ │ x_1 │ 0.54│ ╘═════╧═════╛
Similarly, we can take the example of the late-for-school model:
In [13]: raw_data = np.random.randint(low=0, high=2, size=(1000, 6)) In [14]: data = pd.DataFrame(raw_data, columns=['A', 'R', 'J', 'G', 'L', ‘Q’]) In [15]: student_model = BayesianModel([('A', 'J'), ('R', 'J'), ('J', 'Q'), ('J', 'L'), ('G', 'L')]) In [16]: student_model.fit(data, estimator=MaximumLikelihoodEstimator) In [17]: student_model.get_cpds() Out[17]: [<TabularCPD representing P(A: 2) at 0x7f9286b1fa113>, <TabularCPD representing P(R: 2) at 0x7f9283b12312>, <TabularCPD representing P(G: 2) at 0x7f9383b15114> <TabularCPD representing P(J: 2 | A: 2, R: 2) at 0x7f9286bw3329>, <TabularCPD representing P(Q: 2 | J: 2) at 0x7f92863kj3294>, <TabularCPD representing P(L: 2 | G: 2, J: 2) at 0x7f9282kj49345>]
So, learning parameters from data is very easy in pgmpy
and requires just a call to the fit
method.