Learning by inference

In the introduction to this chapter, we saw that learning can be done in a frequentist way by counting data. In most cases, it will be sufficient, but it is also a narrow view of the notion of learning. More generally speaking, learning is the problem of integrating data into the domain knowledge in order to create a new model or improve an existing model. Therefore, learning can be seen as an inference problem, where one updates an existing model toward a better model.

Let's consider a simple problem: modeling the results of tossing a coin. We want to test if the coin is fair or not. Let's call θ the probability that the coin lands on its head. A fair throw would have a probability of 0.5. By tossing the coin several times we want to estimate this probability. Let's say the i-th toss outcome is vi = 1 if the coin shows a head and 0 otherwise. We also assume there is no dependence between each toss, which means observations are i.i.d. And finally, we consider each toss as a random variable too. The joint distribution of a sequence is P(v1, v2, …vn, θ). Each toss is dependent on the probability θ, so the model is:

Learning by inference

As a graphical model, this can be represented as shown in the following figures:

Learning by inference

And it's about time we introduced a new notation term for graphical models: the plate notation. The left figure is our usual representation in which we only represent the first and the last of the vi nodes, to make the figure simpler. This can sometimes be a bit confusing and in many cases ambiguous or cumbersome when one has too many nodes. The right side of the figure represents exactly the same graph where the box means that the node(s) inside are repeated N times.

In the previous section, we saw that learning integrates the data into the model and in the first chapter we saw that, using the Bayes formula, it was possible to update a probability distribution given new information. Applying the same principle, in our present problem we want to estimate the following probability:

Learning by inference

This is a simple application, again, of the Bayes formula.

The next step is to specify the various factors of this formula, beginning with the prior P(θ). Intuitively, θ is a continuous variable because it should take any value between 0 and 1. However, we will simplify the problem and stay in the world of discrete variables only. Let's say θ can take three different values— unfair on both sides and fair; this is θ {0.2,0.5,0.8}, and we give the prior probabilities:

Learning by inference

It means we believe that, with 75%, the coin is fair; with 20% it is biased toward the head; and with 5% it is biased towards the tail. The next step is to estimate the posterior distribution of θ as given earlier. Note that we will omit the denominator for the time being and use the symbol Learning by inference, which means "is proportional to" with instead of a pure equality (=).

Therefore the posterior distribution can be estimated by:

Learning by inference

This formula is not as complex as it seems. First we replace P(v1, … ,vn, θ) by its decomposition as given with the graph. Then we replace P(vi | θ) by its own expression, which is θ if vi = 1 and (1 - θ) if vi = 0. The function I[] is equal to 1 if the condition inside the brackets is true, and 0 otherwise; we use the fact that x0 = 1.

I will let the reader finish the calculus and we finally obtain:

Learning by inference

The sums in this expression are simply the number of heads (resp. tails) in our experiment to see if the coin is fair. If we call these two sums Nhead and Ntail, we can simplify our posterior on θ by doing:

Learning by inference

So we can finally feed in this formula in R and look at the results of this Bayesian learning as an inference:

posterior <- function(prob,nh,nt, Theta=c(0.2,0.5,0.8))
{
        x=numeric(3)
        for(i in 1:3)
                x[i] = prob[i] * (Theta[i]^nh) * ((1-Theta[i])^nt)

        norm = sum(x)
        return(x/norm)
}

In this little function, prob is the vector of probability for each value of θ, nh, and nt are as defined just as before, and Theta is the vector of possible values for θ. We use the previously mentioned values by default. This code could be optimized but we preferred to keep it simple. The most important line is the one implementing our posterior formula. Then a normalization factor is the computer and the posterior probability distribution on θ is returned.

Let's play with this function to see what happens when different prior probabilities are given:

posterior(c(0.2,0.75,0.05),2,8)
[1] 6.469319e-01 3.530287e-01 3.948559e-05

posterior(c(0.2,0.75,0.05),8,2)
[1] 0.0003067321 0.6855996202 0.3140936477
posterior(c(0.2,0.75,0.05),5,5)
[1] 0.027643708 0.965445364 0.006910927

posterior(c(0.2,0.75,0.05),10,10)
[1] 0.0030626872 0.9961716410 0.0007656718

posterior(c(0.2,0.75,0.05),50,50)
[1] 5.432096e-11 1.000000e+00 1.358024e-11

We do the following experiments: 2 heads and 8 tails, 8 heads and 2 tails, 5 of each, 10 of each, and 50 of each. Note that the last experiment has a distribution summing to 1 due to errors in the exponentiation. This is something the reader should always test to debug his or her programs: a probability distribution, of course, always sums to 1. In this case, it means we reached the machine precision limit and errors occurred.

Let's analyze the results and see how we can solve this precision problem:

  • 2 heads and 8 tails: the coin is biased toward a small value for θ = 0.1 with 65% of the chances. It means it's possible the coin is biased towards tails. But the fairness of the coin is still 35%, which is not small either.
  • 8 heads and 2 tails: we obtain the opposite results, but because our prior on the coin (being biased towards heads) was low (P(θ = 0.8) = 0.05), the result is still in favor of a fair coin with 68%.
  • If we obtain an equal number of heads and tails, then the results are strongly in favor of a fair coin, increasing in probability when the number of experiments increases too.

Finally, here is a trick you must know when dealing with such probability computations. When you have to multiply a lot of small numbers like that, doing so uses logarithms and additions instead of regular values and multiplications. Therefore, the new algorithm will use the following equality: Learning by inference. The new algorithm would therefore have the following change to compute x[i]:

x[i] = exp ( log(prob[i]) + nh*log(Theta[i]) + nt*log(1-Theta[i]) )

The last test we do is when the prior distribution is uniform—that is we give equal chances to each possible situation of the coin. Using our function we obtain:

posterior(c(1/3,1/3,1/3),2,8,c(0.2,0.5,0.8))
[1] 0.8727806225 0.1270062963 0.0002130812
posterior(c(1/3,1/3,1/3),8,2,c(0.2,0.5,0.8))
[1] 0.0002130812 0.1270062963 0.8727806225
posterior(c(1/3,1/3,1/3),5,5,c(0.2,0.5,0.8))
[1] 0.08839212 0.82321576 0.08839212

And we can see that the conclusion is moving to higher probabilities each time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset