Basic sampling algorithms

We start our study of sampling algorithms by looking at some basic algorithms that can be used as subroutines in more advanced schemes. In all the algorithms from this book, we assume that we can generate random numbers uniformly from the interval [0,1]. The problem of generating random numbers on a computer is complex and vast. In most cases, random numbers will be generated by a deterministic algorithm, and they are called pseudo-random numbers. They are not random, but their distribution and properties are close enough to a real random generator that they can be used as real random numbers. Pseudo-random number generators are usually based on the estimation of chaotic functions that are extremely sensitive to their initial conditions, which are called the seed. Changing the value can generate a completely different sequence of numbers, even if the seed value is just a little bit different from the previous one. Nowadays, we also have electronic devices that can generate random numbers from physical phenomena such as thermal noise, photoelectric effects, or quantum phenomena.

Standard distributions

In R, it is possible to generate random numbers from standard distributions. However, for the sake of understanding, we will review simple techniques for how to generate random numbers from a uniform distribution only.

In R, random numbers can be generated from a family of functions beginning with the letter r, such as runif, rnorm, rbeta, rbinom, rcauchy, rgamma, rgeom, rhyper, rlogis, and so on. In fact, density can be estimated using the functions beginning with d, and the cumulative distribution function with functions beginning with the letter p.

For example:

runif(1)
[1] 0.593396

Here, the parameter of the function is the number of numbers one wants:

runif(10)
 [1] 0.7334754 0.2519494 0.7332522 0.9194623 0.5867712 0.3880692 0.2869559
 [8] 0.7379801 0.4886681 0.5329107

Of course, these are (pseudo) random numbers, so your results will be different from the examples shown earlier.

rnorm(1,10,1)
[1] 9.319718

This generates a normally distributed number, with mean 10 and variance 1. If we generate many of these numbers and plot the running mean, we'll see the mean converging little by little to its true value. This is the main property we will use throughout this chapter and in sampling algorithms in general.

x<-rnorm(1000,10,1)
y<-cumsum(x)/(1:1000)
plot(y,t='l')
abline(h=10)
Standard distributions

Generating a random number from a simple distribution is the basis of all sampling algorithms. Here, we consider we know how to generate a random number from a uniform distribution. In R, we perform runif(1,0,1). By default, the min and max parameters are already 0 and 1. Therefore runif(1) will do.

Suppose we transform the uniformly random values with a function Standard distributions such that y=f(x). The distribution of y will be:

Standard distributions

We need functions f(x) such that the distributions of y are distributed according to the desired distribution p(y). Integrating p(y), we have:

Standard distributions

Therefore, Standard distributions, which is the inverse function on the indefinite integral of the desired distribution. Let's take a simple example with the exponential distribution. This distribution is continuous with a density function Standard distributions and a support on [0, +∞[. Integrating h(y) gives Standard distributions, which is the cumulative distribution function of the exponential distribution.

As a side note, the exponential distribution is useful to describe the lengths of inter-arrival times in a Poisson process. It can be viewed as the continuous version of the geometric distribution.

Therefore, if we transform our uniformly distributed variable x with the function Standard distributions, then y will be exponentially distributed.

We can check it experimentally by plotting the distribution of our function and comparing it with the distribution of an exponential distribution. We take lambda=2 in this example:

x<-runif(20000)
inv_h<-function(x,lambda) -(1/lambda)*log(1-x)
hist(inv_h(x,2),breaks=100,freq=F)
t<-seq(0,4,0.01)
lines(t,dexp(t,2),lw=2)

We first generate 20,000 points from a uniform U(0,1) distribution. Then, inv_h is the function defined earlier and we plot a histogram. Note the parameter freq=F to draw with the densities instead of the frequency. Finally, we draw the density function of an exponential distribution with the same parameter (the thick black line in the following graph) and see that the two distributions, the empirical one and the analytic one, are a very good match.

The following screenshot shows the result:

Standard distributions

The problem with this technique is the evaluation of the indefinite integral. In simple cases, this integral is readily available but this is not always so. In such a case, we need another strategy and the key is to use a simpler distribution to approximate the more complex distribution we can sample from. There are two basic techniques to do that. One is called rejection sampling, and it uses a simple distribution to draw a sample from and accept the sample at the same time as it falls into the more complex distribution. Otherwise it is rejected. The other technique is called importance sampling and, in this case, samples from the approximate distribution are corrected to take into account their difference with respect to the original distribution one wants to sample from.

Both techniques are important and used as a basis for more advanced techniques, such as Markov Chain Monte-Carlo (MCMC), that we will see in the second part of this chapter.

In all cases, one of the main ideas is to use a proposal distribution, whose goal is to approximate, even roughly, the distribution we want to sample from. We will call q(x) the proposal distribution and p(x) the initial distribution in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset