Chapter 4. Bayesian Modeling – Basic Models

After learning how to represent graphical models, how to compute posterior distributions, how to use parameters with maximum likelihood estimation, and even how to learn the same models when data is missing and variables are hidden, we are going to delve into the problem of modeling using the Bayesian paradigm. In this chapter, we will see that some simple problems are not easy to model and compute and will necessitate specific solutions. First of all, inference is a difficult problem and the junction tree algorithm only solves specific problems. Second, the representation of the models has so far been based on discrete variables.

In this chapter we will introduce simple, yet powerful, Bayesian models, and show how to represent them as probabilistic graphical models. We will see how their parameters can be learned efficiently, by using different techniques, and also how to perform inference on those models in the most efficient way. The algorithms we will see are adapted to these models and take into account the specificity of each.

And, for the first time, we will start to use variables with continuous support—that is, random variables that can take any value as a number—and not just a finite number of discrete values.

We will look at simple models that can be used as a basic component for more advanced solutions. These models are fundamental and we will go from very simple things to more advanced problems, such as Gaussian mixture models. All these models are heavily used and have a nice Bayesian representation that we will present throughout this chapter.

More specifically, we will be interested in the following models:

  • The Naive Bayes model and its extension, used mainly for classification
  • The Beta-Binomial model, which is one of the most fundamental modelings
  • Gaussian Mixture models, one of the most used clustering models

The Naive Bayes model

The Naive Bayes model is one of the most well-known classification models used in machine learning. Despite its simple appearance, this model is very powerful and gives good results with little effort. Of course, when considering the problem of classification, one should not always stay with one model, such as Naive Bayes, but should try out many examples to see which one is the best with a particular dataset.

Classification is an important problem in machine learning and it could be defined as the task of associating observations to a particular class. Let's say we have a dataset with n variables and we assign a class to each data point. The class could be {0,1} or {a,b,c,d}, {red, blue, green, yellow}, or {warm, cold}, and so on. We will see that it is sometimes easier to consider binary classification problems where one has only two classes. But most classification models can be extended to more than two classes.

For example, given physiological characteristics, we can classify animals into mammals or reptiles. Given the words used in an email, we can classify it as a junk email or a legitimate email. Given a credit record and other financial data, we can classify a client as trusted for a loan or not.

Just try the next little example to see a (not-so) obvious problem of classification.

Sigma <- matrix(c(10,3,3,2),2,2)
x1 <- mvrnorm(100,c(1,2),Sigma)
x2 <- mvrnorm(100,c(-1,-2),Sigma)
plot(x1,col=2,xlim=c(-5,5),ylim=c(-5,5))
points(x2,col=3)
The Naive Bayes model

This example shows a two-variate classification problem with two classes, the red and the green. The two variables are represented as the x axis and the y axis. The problem seems obvious but it is not, because the interface between the red class and the green class is not clearly defined. This is typical of real-world problems.

In this case, we can still draw a clear line in the middle to separate the two classes. But sometimes it is not obvious and a line won't work. When a line can separate the two classes we call it linear classification. When we need a curvier separation, we call it non-linear separation.

The way we estimate the quality of a classifier is by looking at the error rate. We want the lowest error rate; that is, every time the classifier predicts a class for a data point, it has to be right. However, depending on the classification problem, the error can have a different consequence. For example, in a medical classification problem, classifying a patient as ill when he or she is not is presumably less dangerous than classifying the patient as healthy and letting him or her go with an undetected illness.

Obviously we want the classifier to be as accurate as possible and the general rule when building classifiers is to concentrate entirely on difficult cases.

Representation

The Naive Bayes model is a probabilistic classification model with N random variables X as features and one random variable C as the class variable. The main (and very strong) assumption made in this model is that, given the class, the features are independent. This seems to be very strong and surprisingly it gives good results in this situation.

The join probability distribution in the Naive Bayes model is:

Representation

It is represented by the following graphical model:

Representation

This is a very simple graphical model in fact and you can see from the graph why knowing the class will make all feature variables independent of each other.

Therefore, by using the Bayes rule, given a new data point X' we can compute the most probable class by doing:

Representation

To make the problem simpler, we will interpret all the Xi variables, as well as the class variable C, as binary variables. However the theory stays the same if the variables have more than two possible values. In fact, the theory of this model is similar even if you consider continuous features too. For example, for real-value features, we can consider Gaussian distributions and have:

Representation

Here, N represents a Gaussian distribution.

When the features are binary the result is the same except that one uses the Bernoulli distribution for the X features:

Representation

Here, x can take values in {0,1} and θic is the probability for Xi to be one given class c.

Learning the Naive Bayes model

Learning a Naive Bayes model is extremely simple. Recalling what we saw in Chapter 3, Learning Parameters, it's very easy to infer that, for each θic, in the case of binary features with a binary class variable, Learning the Naive Bayes model where Nic is the count of 1 variable Xi when the class is C =c and Nc is the count of class 1.

As for the class variable, it's even simpler: πc = Nc over N where N is the total number of data points.

The reason for that is the same as in the previous chapter. In order to understand why, we need to write the maximum likelihood of this model. For one data point, the probability is:

Learning the Naive Bayes model

Knowing a class can take values in {0,1} only in the case of a binary classifier, we have therefore:

Learning the Naive Bayes model

And therefore the log-likelihood is

Learning the Naive Bayes model

In order to maximize this function, we see that we can optimize each term individually, leading to the simple form we obtained for each parameter. So, naturally, it gives exactly the same results as general graphical models.

Instead of implementing the model manually, we will use an R package named e1071. If you don't have it yet, you can install and load it by doing:

install.packages("e1071")
library(e1071)

This provides a full implementation of the Naive Bayes model. We can now load data and look at some results:

data(iris)
model <- naiveBayes(Species~.,data=iris)

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
    setosa versicolor  virginica
 0.3333333  0.3333333  0.3333333

Conditional probabilities:
            Sepal.Length
Y             [,1]      [,2]
  setosa     5.006 0.3524897
  versicolor 5.936 0.5161711
  virginica  6.588 0.6358796

            Sepal.Width
Y             [,1]      [,2]
  setosa     3.428 0.3790644
  versicolor 2.770 0.3137983
  virginica  2.974 0.3224966


            Petal.Length
Y             [,1]      [,2]
  setosa     1.462 0.1736640
  versicolor 4.260 0.4699110
  virginica  5.552 0.5518947

            Petal.Width
Y             [,1]      [,2]
  setosa     0.246 0.1053856
  versicolor 1.326 0.1977527
  virginica  2.026 0.2746501

This example needs a bit of explanation. The laplace parameter controls the Laplace smoothing of the data, in order to help the model when data is not perfectly balanced or the dataset exhibits problematic situations. We will come back to this problem later, but it's one of the main problems one has to deal with in most classification problems.

By using this model and trying to predict (or infer) the class, we obtain the following:

p <- predict(model,iris)
hitrate <- sum(p == iris$Species) / nrow(iris)

And we obtain a hit rate of 0.96, as 96% of the data points were correctly classified. It's great but bear in mind that we use the training dataset to compute this percentage. You can't estimate the real power of a classification model using only data points you used to train the model. Ideally, we should split the dataset in two; let's say we will use 1/3 to test it and 2/3 to train the model. Ideally, the split has to be done randomly:

ni <- sample(1:nrow(iris),2*nrow(iris)/3)
no <- setdiff(1:nrow(iris),ni)
model <- naiveBayes(Species~.,data=iris[ni,])
p <- predict(model,iris[no,])

Here, ni and no are a list of data point indices taken at random from the initial dataset.

Bayesian Naive Bayes

This model, despite its name, is not a Bayesian model. To be fully Bayesian we should express prior on the parameters. In the Naive Bayes model, the parameters are those of the class variable πc and those of the feature variables θi. These parameters have been estimated using a maximum likelihood method but what happens if the dataset is unbalanced? What will happen to the parameters if the dataset lacks enough points for a certain number of cases? We will end up with very bad estimators and in the worst case we will have zeros for the ill-represented parameters. This is obviously something we don't want because the results will be completely false, giving too much importance to the features' values or classes and none to the rest.

This problem is called over-fitting. And one simple solution to over-fitting is to use a Bayesian approach and include extra information in the model to say, "If the data is not represented, then let's assume it has a very small probability but not a zero probability."

One elegant and simple solution to this problem is to use prior distributions on the parameters of the model and develop the model in a Bayesian way. Let's make a few assumptions in order to simplify the calculus. First of all, we will assume that all the feature variables have the same finite number values. We will call this number S. You can easily generalize this to any number of values for each feature, but in our presentation it makes things simpler. Then we will assume we can use a factored prior on the θ feature parameters, as follows:

Bayesian Naive Bayes

Here, θ represents all the parameters. In order to make it clear, we use the following notation:

  • θ represents all the feature parameters and π the class parameters
  • θi represents all the parameters of the variable i—that is, the parameters of the conditional distribution p(Xi | C)
  • θic represents all the parameters of the variable i—that is, the parameter distribution p(Xi | C = c)
  • Bayesian Naive Bayes represents the parameter of the probability p(Xi = s| C = c)—that is, because Xi is a discrete multinomial variable (in fact it's more accurate to say categorical than multinomial here), Bayesian Naive Bayes

And because we just mentioned multinomial distribution, it's important to note that the Dirichlet distribution is also the conjugate distribution for the multinomial (and the categorical) distribution. If we consider all the Bayesian Naive Bayes to be random variables and no longer just simple parameters, we need to give them a probability distribution a priori. We will assume they are Dirichlet distributed for two reasons:

  • The Dirichlet distribution is a distribution over a vector of values such that their sum is 1, which corresponds to the well-known constraint that Bayesian Naive Bayes. So far, nothing new.
  • The Dirichlet distribution being the conjugate prior for the multinomial distribution, this means that, if a data point has a categorical or multinomial distribution and the prior distribution on the parameters is a Dirichlet (as in our case), then the posterior distribution on the parameters is also a Dirichlet. And this will simplify all of our computations.

In fact, contumacy is a very powerful tool in Bayesian data analysis.

In practice it works as follows:

  • Let's say that α is the concentration parameter—that is, the parameter of the Dirichlet distribution Dir().
  • So we assume that the θ's are distributed Dirichlet—that is p(θic | ) = Dir().
  • And, of course, we know that our feature variables have a categorical or multinomial distribution.
  • Therefore the posterior probability of the parameters of the distribution of Xi after counting data (as we did before), will be a Dirichlet Bayesian Naive Bayes, where Ni is the counts we did before! It's as simple as that, thanks to the conjugacy.

So finally, if we want to incorporate the Dirichlet prior into our computations, the posteriors of the parameters for the class variable are:

Bayesian Naive Bayes where Bayesian Naive Bayes.

And the prior distribution of π is a Dir(⍺) where = (1, …,c).

For the parameters of the feature variables, the solution is exactly the same:

Bayesian Naive Bayes where Bayesian Naive Bayes

And the prior distribution of θic is a Dir(β), where β = (β1, …, βs).

Wait! Is it really as simple as this? Well, yes it is, thanks to the conjugacy in this Bayesian model. If you look carefully at the formulas, you will see that none of the πc and Bayesian Naive Bayes can be equal to zero now because of the values of α and β. Indeed, in the definition of the Dirichlet distribution, it is required that the parameters of the Dirichlet should be strictly positive.

So the last problem we need to solve is choosing a value for α and β. One common choice is to take 1 for all of them. In terms of Dirichlet distributions, it means we choose a uniform prior for all the parameters of the class and feature variables. It means we will allow our parameters to take any value with equal probability except of course 0. Choosing different values for α and β will lead to different form results. We can try to promote certain values by pushing the Dirichlet distribution in one direction or another; or, on the contrary, we can try to keep all parameters with similar values.

If you choose 1 for the Dirichlet parameter, you will obtain something called Laplace smoothing, which we saw before in the naiveBayes function of the e1071 R package. Sometimes, it is also called a pseudo-count because it could be seen as artificially adding one example of any case to your dataset.

But the Dirichlet prior is not the only possible prior we can use. In the case of binary variables, another distribution of interest is the Beta distribution. In the next section we will present more formally the Beta-Binomial model and see its relation to the Dirichlet-Multinomial model we just saw. We will see that the results are similar and also how to play with the parameters of the Beta distribution in order to describe different types of prior for our class and feature variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset