Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Beta-Binomial

The Beta-Binomial prior is another example and a well-known model where we set a prior distribution on the parameter of the distribution of interest. Here we are going to use a binomial distribution with a θ parameter. The θ parameter can be seen as a probability that an event will occur or not, or a proportion of the positive events in a sequence of experiments. Therefore, the parameter θ takes values in [0,1].

Let's first review the Binomial distribution with a simple example: let's say we have a coin and we want to know if the coin is impartial or not when we play the heads or tails game. The game is to toss the coin N times and try to estimate what is the probability θ of obtaining a head. This problem is very important because it is the basis of many other models. You can replace the game of heads or tails with the result of experimentation (positive or negative), the result of a poll (yes or no), or any other binary answer.

Historically, this model has been studied by Thomas Bayes and generalized by Laplace, thus giving birth to the Bayes rule (or more formally the Bayes-Laplace, as we saw in Chapter 1, Probabilistic Reasoning).

In this problem we will follow again a Bayesian approach: we need a prior distribution on the parameter θ, a likelihood of the data given the parameter p(x | θ), and we will compute the posterior distribution p(θ | x).

If we want to be complete, we can also compute the predictive posterior distribution p(x | θ, x), which is the distribution of a new data point (a new toss) given parameters and the previous experiments.

When we assume that all the observations (that is, the results of each toss) are independently and identically distributed we can again write that:

Here, D = {x₁, …, x_N} is the dataset.

Is this assumption true? From a theoretical point of view, it's true because: (1) every time we toss a coin, the previous toss has no influence on the new one and (2) we use the same coin for all the tosses, so the parameter θ doesn't change. Therefore the distribution of heads and tails is the same. But is it true in real life? If we assume that each toss can microscopically change the air in the room and that each toss will rip off a few atoms of metal from the coin, then the distribution is certainly not independent and certainly not identical. But in fact the acute reader will have understood that those effects are so negligible that they have absolutely no influence on the results. Maybe one should toss the coin a few billion times to start seeing a difference.

However, when designing such an experiment, one has to be careful about the conditions of the experimentation and be sure that the data is indeed identically and independently distributed. For example, if the experiment is a poll and one puts the same question to people next to each other, one after the other, it is very likely that the second person will be influenced by the answer of the first one. And therefore the data will not be iid anymore.

Now we can star solving our Bayesian problem by assuming distributions for all the required components.

The Bernoulli distribution is a probability distribution that gives the probability θ to a random variable when it takes the value 1 and (1 –θ) if it takes the value 0. We say that x ~ Ber(θ), that is p(x | θ) = θ^x(1 – θ)^1-x with x ϵ {0,1}. If we repeat the same Bernoulli experiment many times (that is, if we toss the coin many times), we obtain the dataset D = {x₁, …, x_N} with this probability distribution:

Because of the iid assumption and the fact that the product is commutative, and if we call N₁ the number of heads and N₀ the number of tails, we can rewrite this likelihood by:

From now on, we can take the log of this expression:

Before going further, it's interesting to understand why we are always using logarithms in our computations. The first reason is historical: probabilities are numbers between 0 and 1. Multiplying two probabilities is essential when computing the likelihood of iid data. In fact, in such a likelihood, it's not unusual to multiply hundreds or thousands of those probabilities. On early computers, multiplication was very slow compared to addition. Because , it was useful to first transform all the data into logarithms and then add them. It was generally faster. Nowadays, this is not true anymore and usually multiplication can be (almost) as fast as addition. The cost of computing the logarithm can sometimes overwhelm the small gain of just doing additions instead of multiplications.

The second reason is that, when we multiply numbers between 0 and 1, the result tends to decrease and be smaller and smaller. This time, even modern computers have a limited capacity to represent numbers and moreover, because real values are discretized (usually following the IEEE-754 norm), the accuracy of computations suffers enormously and errors are accumulated throughout the computations, leading to very inaccurate if not false results. However, with logarithms, numbers between -∞ and 0 are added together, making the computations more accurate. Small values are now contributing a lot to the results, making computations accurate.

Usually we take the negative log-likelihood, and we only have to deal with positive numbers. The other reason is that, when one wants to maximize the likelihood, one can also minimize (towards zero) the negative log-likelihood. This is an equivalent problem. And a lot of optimization algorithms try to find the zeros of functions. Therefore implementation is simpler.

To illustrate this, just plot the following in R:

x <- seq(0,1,0.05)
plot(x, -log(x), t='b', col=1)

After this parenthesis, we are back to the problem of a sequence of tosses. So we assumed that the random variable representing the result of a toss has a Bernoulli distribution and we calculated the likelihood of a sequence of N tosses.

Now let's consider the problem from a different angle and let's assume we toss the coin N times, N being known in advance. The question is now: what is the probability of obtaining N₁ heads out of N tosses.

On average, the answer will depend on how the coin is biased. And the bias of the coin is known too; it is the parameter θ. If θ = 0.5, then N₁ should rather be . We want to have a probability for each possible value that could take N₁ in the function of θ and N. In this case, it is important to remark the following: how many positive cases do we have? By positive cases, we mean, How many sequences of N tosses can provide N₁ heads? On a small example with N=3 and N₁ =2, we can have {HHT, HTH, THH}. That is three possibilities. In general, we want to know the number of combinations of r values out of n events with replacement. And this is equal to Beta-Binomial . Now remember that each sequence is independent of the others and that the probability of two independent events is the sum of probability P(A ⋁ B) = P(A) + P(B). So finally, the probability of independent Bernoulli events of parameter θ is:

This probability distribution is very well-known and is called the Binomial distribution. In fact, the Binomial distribution is usually defined with two parameters: N, the total number of tosses, and θ. It is commonly written as:

In R, the binomial distribution is provided by default in the language and it can be used with the following functions:

dbinom: Density
pbinom: Cumulative
qbinom: Quantile
rbinom: Random generation

We can illustrate the distribution of the binomial in R with the little program that follows:

x<-seq(1,20)
plot(x,dbinom(x,20,0.5),t='b',col=1,ylim=c(0,0.3))
lines(x,dbinom(x,20,0.3),t='b',col=2)
lines(x,dbinom(x,20,0.1),t='b',col=3)

We show three different distributions with the parameter θ varying from 0.1 to 0.5. When θ is small, obviously the probability of having many positive outcomes quickly decreases. When θ is 0.5, the black curve shows that the probability of 50% positive outcomes is obviously the highest.

The prior distribution

The next question that arises is, What prior distribution should we use as a prior on θ? The beta distribution is a very common choice and has the nice property of being a conjugate to the Binomial and Bernoulli distributions.

In fact the beta distribution has a very nice form, similar to the Binomial and Bernoulli distributions, as follows:

We will add the normalization constant later. The good thing about the Beta is that its domain, that is, the value that θ can take, is [0,1]. And therefore θ from the Beta distribution can also be interpreted as a proportion or a probability and used as a parameter in the Binomial or Bernoulli distributions. This makes the Beta distribution a perfect candidate for the prior distribution. To complete the formula we have to realize that this distribution is a density and the integral over its domain must be one. Therefore it is usual to write:

And the integral at the denominator is known as the Beta function. In general we can simply write the density as:

Here, the Gamma function is defined as:

When x is an integer

The posterior distribution with the conjugacy property

Now we need to combine the Binomial distribution with the Beta prior to obtain the posterior distribution. The posterior distribution is obtained by applying the Bayes rule as usual:

And finally, by replacing each distribution with its analytical form, we obtain:

And this is simply proportional to:

And in fact the last form is exactly the same as the initial form of the Beta distribution. It means that our posterior distribution on θ is also Beta-distributed.

Therefore, we have found our posterior and we can resume this calculus as follows:

If n follows a Binomial distribution Binomial(θ, N) and the prior distribution over θ is Beta(α, β), then the posterior distribution on θ will also be a Beta distribution Beta(⍺ + n, β+ N – n).

In this case, thanks to the conjugacy property, we've made a very efficient posterior computation, which boils down to a few additions only. The notion of conjugacy is extremely important in Bayesian reasoning mainly for this property.

Which values should we choose for the Beta parameters?

This will depend on what type of information we want to include in the model. For example, we might decide that every value of θ is a priori acceptable and we want to give the same importance to all of them. This is what we did with the Dirichlet distribution in the previous section when adding pseudo-counts of 1.

With the Beta distribution, we can have a uniform distribution with the parameters Beta(1,1). But we can also try to give more importance to extreme values close to 0 or 1, with for example a Beta(0.5,0.5) distribution. On the other hand, to force θ to stay more centered around 0.5, we can use Beta(2,2), Beta(3,3). The higher the values, the more probability mass is given to the center. The next code in R shows different distributions with different values:

x <- seq(0,1,length=100)
par(mfrow=c(2,2))
param <- list(
  list(c(2,1),c(4,2),c(6,3),c(8,4)),
  list(c(2,2),c(3,2),c(4,2),c(5,2)),
  list(c(1,1),c(2,2),c(3,3),c(4,4)),
  list(c(0.5,0.5),c(0.5,1),c(0.8,0.8)))
for(p in param)
{
  c <- 1
  leg <- character(0)
  fill <- integer(0)
  plot(0,0,type='n', xlim=c(0,1),ylim=c(0,4))
  for(v in p)
  {
    lines(x,dbeta(x,v[1],v[2]),col=c)
    leg <- c(leg, paste0("Beta(",v[1],",",v[2],")"))
    fill <- c(fill,c)
    c <- c+1
  }
  legend(0.65,4,leg,fill,bty="n") }

Which values should we choose for the Beta parameters?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Beta-Binomial

Create new playlist

Sign In

Sign Up

Beta-Binomial

The prior distribution

The posterior distribution with the conjugacy property

Which values should we choose for the Beta parameters?

Table of Contents for
Beta-Binomial