Chapter 3. Learning Parameters

Building a probabilistic graphical model requires in general three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we have given numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameter learning. In other fields, it is also called parameter fitting or model calibration.

Parameter learning is one important topic in machine learning. In this chapter we will see how we can use a dataset and learn the parameters for a given graphical model. We will go from the simple but common use case, in which the data is fully observable, to a more complex case, in which the data is partially observed, and therefore needs more advanced techniques.

Learning parameters can be done with several approaches and there is no ultimate solution to the problem, because it depends on the goal the model's user wants to reach. Nevertheless it is common to use the notion of maximum likelihood of a model and also maximum a posteriori. As you are now used to the notions of prior and posterior distribution, you can already guess what a maximum a posteriori could do.

In this chapter we will use datasets. When we have many variables in a model, at any time we can observe the value of those variables. Many observations of all the variables at the same time represent a dataset. For example, we have a model about a student's performance at University. In this model, we have several random variables such as the student's age, course, grade, gender, and year. A single observation could be {21, Statistics, B+, female, 2nd year}. And a dataset is a large collection of such observations.

Throughout this chapter, we will make the assumption that the dataset is iid, an acronym for independently and identically distributed. It means that each observation has been made using the same probability distribution and each observation is independent of all others in the dataset. As for the student's example it makes a lot of sense. But if we consider a time series dataset, such as the GDP of a country, then the dataset is not i.i.d anymore and different algorithms will be necessary to learn the parameters. As a matter of fact, i.i.d datasets cover a wide range of applications.

With all these notions in hand, we can now discuss the main topic in the chapter in a little more depth. Let's call D the dataset and θ the parameter of the graphical model. Then we call likelihood the function P(D | θ)—in other words, the probability to observe (or generate) the dataset given the parameters. This is why probabilistic graphical models are sometimes called generative models.

A maximum likelihood estimation aims at finding the value of parameter θ, which maximizes the likelihood P(D | θ), and it is written as Learning Parameters. It is an optimization problem where one searches for the optimal value of θ, which maximizes P(D | θ).

If we want to be more precise about θ, we can adopt a Bayesian approach and also give a prior distribution over the parameters θ, P(θ). In this case, finding the parameter value boils down to finding the maximum value for P(D | θ).P(θ). This is called a maximum a posteriori.

In this chapter, we will start by looking at simple examples of parameter estimation with a maximum likelihood and show how to implement them in R. Then we will look at the maximum likelihood estimation of a probabilistic graphical model. Finally, we will look at the harder estimation problem that occurs when data is missing, either randomly or when one has hidden variables. This will give us the opportunity to introduce one of the most important algorithms in machine learning: the E.M. algorithm. E.M. means Expectation Maximization.

The chapter will be structured as follow:

  • An introduction with a simple example
  • Learning as inference
  • Maximum likelihood
  • The EM algorithm

Introduction

In this chapter, we will learn how to make the computer learn about the parameters of a model. Our examples will use various datasets we will build ourselves or other datasets we will download from various websites. There are many datasets available online and we will use data from the UCI machine learning repository. These are made available by the Centre for Machine Learning and Intelligent Systems of the University of California, Irvine (UCI).

For example, one of the most famous datasets is the Iris dataset where each data point in the dataset represents the characteristics of an iris plant. Different attributes are used such as the sepal length/width and petal length/width.

It is possible to download this dataset and store it into a data.frame in R as we will do most of the time. Each variable is in a column and we will use i.i.d data (or assume they are in order) to simplify the calculus and computations.

Let's load the dataset first:

x=read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",col.names=c("sepal_length","sepal_width","petal_length","petal_width","class"))
head(x)

   sepal_length  sepal_width   petal_length petal_width   class 
1        4.9          3.0          1.4        0.2         Iris-setosa
2        4.7          3.2          1.3        0.2         Iris-setosa
3        4.6          3.1          1.5        0.2         Iris-setosa
4        5.0          3.6          1.4        0.2         Iris-setosa
5        5.4          3.9          1.7        0.4         Iris-setosa
6        4.6          3.4          1.4        0.3         Iris-setosa

As we see, each observation of the dataset is represented by one line in the dataset. It will be very useful later to use data.frame to simplify the computations of parameters.

We can do some simple estimation with this dataset. For example, if we consider only the first variable sepal_length and assume this variable follows a Gaussian distribution, then a maximum likelihood estimation of the two parameters of a Gaussian distribution (mean and variance) would simply be to compute the empirical mean and empirical variance. In R, it is as simple as that:

mean(x$sepal_length)
[1] 5.848322

var(x$sepal_length)
[1] 0.6865681          

If we want to deal with discrete variables, as we will do in most of the chapter, we can use the well-known plyr package to simplify our computations:

library(plyr)

Now, we compute a distribution over the class variable in the data.frame by doing:

y = daply(x,.(class),nrow) / nrow(x)
y
      Iris-setosa      Iris-versicolor   Iris-virginica
      0.3288591       0.3355705       0.3355705

It is interesting to see the distribution of each class which is approximately 33% each. What we have done here is simply count the number of occurrences of each value in the column class of the data.frame and divide it by the total number of values. This gives a distribution and could also be used as prior probabilities on each class. In this case, our distribution would be roughly uniform.

If we go a bit further, we can also look at the distribution of another variable given a class. Let's assume that sepal_length is a Gaussian distribution with mean µ and variance σ2. A simple joint distribution is given by the following factorization: Introduction.

Computing the conditional distribution P(SepalLength | Class) is the equivalent of computing all the mean and variance values for each value of the class variable. It is done by running:

daply(x,.(class), function(n) mean(n$sepal_length))

    Iris-setosa       Iris-versicolor  Iris-virginica
       5.004082              5.936000        6.588000

And similarly, the variances of each distribution conditioned on the class variable are given by:

daply(x,.(class), function(n) var(n$sepal_length))

    Iris-setosa        Iris-versicolor  Iris-virginica
      0.1266497              0.2664327       0.4043429

It is therefore very easy to compute conditional distributions using simple R functions. If we want to compute the same thing for discrete distributions, we could use the following code. First, let's transform the sepal_width variable into a discrete variable by discretizing it. It represents a width, so let's say we have three different values (for the sake of simplicity): {small, medium, large}. We can do that automatically with the following code:

q <- quantile(x$sepal_width,seq(0,1,.33))

We find the 33% and 66% quantiles of the variable sepal_width. Every value under 33% is small, every value between 33% and 66% is medium, and the rest over 66% are large.

 q
   0%   33%   66%   99%
2.000 2.900 3.200 4.152

Then we create a new variable in the data.frame, the discretized version of sepal_width, by doing the following:

x$dsw[ x$sepal_ width < q['33%']] = "small"
x$dsw[ x$sepal_ width >= q['33%'] & x$sepal_width < q['66%'] ] = "medium"
x$dsw[ x$sepal_ width >= q['66%'] ] = "large"

For each interval as defined by the quantiles, we associate a value small, medium, or large to a new column in x called dsw (for discrete sepal width).

And finally, we can learn the conditional probability distribution P(dsw | class) by doing the following as before:

p1 <- daply(x,.(dsw,class), function(n) nrow(n))

p1
        class
dsw      Iris-setosa Iris-versicolor Iris-virginica
  large        36               5            13
  medium       12              18            18
  small         1              27            19

This gives us the count of each occurrence of each value of dsw when class has a specific value. If we want to transform it into probabilities, we need to divide each column by its sum. Indeed, each column represents a probability distribution by itself. This can be achieved by doing:

 p1 <- p1 / colSums(p1)

And the result is finally:

       class
dsw          Iris-setosa       Iris-versicolor Iris-virginica
  large        0.7346939       0.1020408      0.2653061
  medium       0.2400000       0.3600000      0.3600000
  small        0.0200000       0.5400000      0.3800000

And by using the previous distribution over class we now have a fully parameterized model for the joint distribution: Introduction.

If we analyze what has been done and try to extract a rule of thumb, we can say that the parameters have been found by counting occurrences of values of sepal_width given each value of class. We can also say that we found the parameters of each factor of the distribution separately: once for P(SepalWidth | class) and once for P(class).

In the next sections, we will learn in a more formal way how we can generalize this notion to learn probabilistic graphical models with discrete variables and why, from a theoretical point of view, it works all the time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset