Predictions from the model using pgmpy

In the previous sections, we have seen various algorithms to computing conditional distributions and learnt how to do MAP queries on the models. A MAP query is essentially a way to predict the states of variables, given the states of other variables. In a real-life problem, we are given some data with which we try to create a model for our problem. Then, using this trained model, we try to predict the states of variables for some new data point. This is the process with which we approach our supervised learning problems in machine learning.

Now, to design the models, we need to create CPDs or factors, add them to the base model, create an inference object, and then do MAP queries over it for new data points to predict variable states. This whole process is done very often in machine learning, so pgmpy provides the direct methods fit and predict to simplify the whole process. Let's look at some code to understand how this works. To keep it simple, we will once again be working with the restaurant model, with each variable having two states.

# First let's import modules that we will be needing
In [1]: import numpy as np
In [2]: from pgmpy.models import BayesianModel

# Now let's create some random data over which we will train and 
# test the model. Here we are creating 1000 data points with each 
# value either 0 or 1.
In [3]: data = np.random.randint(low=0, high=2, size=(1000, 4))
In [4]: data
Out[4]: 
array([[0, 1, 0, 0],
       [1, 1, 1, 0],
       [1, 1, 0, 0],
        ..., 
       [1, 0, 0, 1],
       [1, 0, 1, 0],
       [1, 0, 0, 0]])

# Now in general machine learning problems it doesn't matter which 
# column of the array represents which variable (until we use same 
# order for both training and prediction) because all the values 
# are on symmetrical axis but in graphical models each variable is 
# different (in the way it is connected to other variables etc) so 
# we will need to specify which columns of data are for which 
# variable. For that we will use pandas.

In [5]: import pandas as pd
In [6]: data = pd.DataFrame(data, columns=['cost', 'quality',  
                                           'location', 
                                           'no_of_people'])
In [7]: data
Out[7]:
     cost  quality  location  no_of_people
0       0        1         0             0
1       1        1         1             0
2       1        1         0             0
3       0        1         1             1
4       1        1         1             0
5       1        0         1             0
6       0        0         0             0
7       0        0         1             0
..     ...      ...       ...           ...
993     0        0         1             1
994     0        0         0             0
995     0        0         0             0
996     1        0         0             0
997     1        0         0             1
998     1        0         1             0
999     1        0         0             0

In [8]: train = data[:750]

# We will try to predict the no_of_people from our model. So for 
# test data we will delete that column and then later on predict 
# those values.
In [9]: test = data[750:].drop('no_of_people', axis=1)
In [10]: test
Out[10]:
     cost  quality  location
750     0        0         1
751     0        1         1
752     0        1         1
753     1        0         0
754     1        0         1
755     1        0         1
756     0        1         0
757     1        0         0
 ..    ...      ...       ...
992     0        0         0
993     0        0         1
994     0        0         0
995     0        0         0
996     1        0         0
997     1        0         0
998     1        0         1
999     1        0         0

# Now we will need to create the base network structure for the 
# model.
In [11]: restaurant_model = BayesianModel(
                      [('location', 'cost'), 
                       ('quality', 'cost'),
                       ('location', 'no_of_people'),
                       ('cost', 'no_of_people')])
In [12]: restaurant_model.fit(train)

# Fit computes the cpd of all the variables from the training data 
# that we provided.
In [13]: restaurant_model.get_cpds()
Out[13]: 
[<pgmpy.factors.CPD.TabularCPD at 0x7fc01c029be0>,
 <pgmpy.factors.CPD.TabularCPD at 0x7fc01c029eb8>,
 <pgmpy.factors.CPD.TabularCPD at 0x7fc01c029e48>,
 <pgmpy.factors.CPD.TabularCPD at 0x7fc01c029e80>]

# Now for predicting the values of no_of_people using this model 
# we can simply call the predict method on our test data.
In [14]: restaurant_model.predict(test).values.ravel()
Out[14]:
array([1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 
 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,   
 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 
 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 
 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0,
 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,  
 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 
 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 
 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 
 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 
 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 
 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 
 0, 0, 0])

We can see here that using fit and predict has reduced a lot of work and simplified things. Also, in some cases, the training data we have might not represent the problem correctly. For example, let's say we know from prior knowledge that the probability of having a restaurant in a good location or a bad location is 0.5, but it is possible that the training set that we have has more data points for restaurants in good locations, which could eventually lead to bias in our model. In such cases, we could manually adjust the probability values in the CPDs so that they represent the actual problem correctly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset