Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 17

Introduction to Monte Carlo and Simulation Methods

17.1 Introduction

The Monte Carlo method is a numerical technique that allows scientists to analyze various natural phenomena and compute complicated quantities by means of repeated generation of random numbers. For example, if you make several thousand tosses of a fair (i.e., balanced) coin, then you may notice that the long-run ratio of a count of heads to the total number of tosses is approaching one half. If the limiting ratio is not close to one half, then you may conclude that the coin is not balanced. Similarly, researchers can compute other more complicated quantities, although in reality nobody tosses coins and throws dice. Instead, computer simulations are used. First, a computer experiment that produces some quantitative information about a random phenomenon of interest needs to be designed. After that, a computer is used to perform independent repeated random simulations and then to calculate averages of results. Such averages are used to approximate the quantity of interest. In particular, the two fundamental theorems, namely, the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT), allow researchers to construct a valid approximation of the quantity of interest and estimate the approximation error.

The Monte Carlo method was coined in the 1940s by John von Neumann, Stanislaw Ulam, and Nicholas Metropolis, while they were working on nuclear weapons project at the Los Alamos National Laboratory. It was named after a famous casino in Monte Carlo, where Ulam’s uncle often gambled away his money. The range of problems that can be analyzed by Monte Carlo methods is vast. Many quantities such as the probability of default of a financial company, distribution of galaxies in the universe, and characteristics of a nuclear reactor can be computed. Monte Carlo methods are especially useful for simulating complex systems with coupled degrees of freedom and with significant uncertainty in inputs, such as the calculation of risk in business. Monte Carlo methods are used to evaluate multidimensional integrals and to numerically solve large systems of equations. The success and popularity of Monte Carlo is partly explained by the enormous progress of computers that are so powerful and inexpensive these days.

The Monte Carlo method is a very popular computational tool in financial economics. Many typical problems such as the optimal allocation of financial assets, pricing of derivative contracts, and evaluation of business risk can be numerically solved by means of repeated generation of possible market scenarios. For example, we can calculate the no-arbitrage value of an option written on the asset by repeatedly generating sample paths of the underlying asset price process. The simplest possible model is the binomial price model. At each time step, the price may go up or down by a constant factor. Every possible scenario can be represented by a random walk on the binomial tree. An estimate of the option price is computed by averaging payoff values calculated for independent realizations of such a random walk.

17.1.1 The “Hit-or-Miss” Method

A typical application of the Monte Carlo method is the computation of areas and volumes of objects of complex shape and geometry. This type of Monte Carlo computation is based on two ideas: geometric probability and the frequentist definition of probability. The latter tells us that the likelihood of some event E can be calculated as a long-run ratio of the number of successful trials to the total number of trials. By a success we mean here a trial where the event E occurs. Count the number of times, NE(n), the event E occurs in the first n trials that are performed independently and under identical conditions. The probability ℙ(E) is approximated as

$ℙ (E) \approx \frac{N_{E} (n)}{n} .$

Now let us link the notion of probability to the area of a figure (or the volume of a multidimensional domain in the general case). Consider a planar figure D contained completely within a unit square S = [0, 1]2. Select a point at random uniformly on the square. This can be done by sampling two independent identically distributed (i.i.d.) Cartesian coordinates uniformly in the interval [0, 1]. According to the principle of geometric probability, the likelihood that such a randomly chosen point belongs to D is equal to the ratio of the area of D, denoted |D|, to the area of the unit square (which equals one). Choose at random n points independently and uniformly on the square. Let ND(n) points fall on the figure D. On the one hand, the ratio ND(n)/n is approximately equal to the probability that a random point chosen uniformly on the square lies on D. On the other hand, this probability equals the ratio of areas of D and S. Thus, we obtain the following approximation of the area of D:

$| D | \approx \frac{N_{D} (n)}{n} \cdot | S | .$

17.1.2 The Law of Large Numbers

The Monte Carlo method is based on two fundamental laws, namely, the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). There are several versions of the LLN. Here are some of them.

Theorem 17.1

(Borel’s LLN). Suppose that an experiment with an uncertain outcome is repeated a large number of times independently and under identical conditions. Then, for any event E we have

$ℙ (E) = \lim_{n \to \infty} \frac{N_{E} (n)}{n},$

where NE(n) is the number of times the event E occurs in the first n trials.

Borel’s LLN provides us with a theoretical foundation for the frequintist interpretation of probability. On the other hand, the probability of an event E can be expressed as a mathematical expectation of an indicator function of $E : ℙ (E) = E [I_{E}]$ . Thus, Borel’s LLN is a special case of a general LLN provided just below.

Let {Xk}k≥1 be a sequence of i.i.d. random variables with common finite mean μX, which are all defined on some probability space (Ω, ℱ, ℙ). Let ${\bar{X}}_{n} = \frac{1}{n} \sum_{k = 1}^{n} X_{k}$ denote the arithmetic average of the first n variables.

Theorem 17.2

(Chebyshev’s Weak LLN). It is true that ${\bar{X}}_{n} \overset{p}{\to} μ_{X}, a s n \to \infty$ . That is, $\forall ε > 0 ℙ (| {\bar{X}}_{n} - μ_{X} | > ε) \to 0, a s n \to \infty$ .

Theorem 17.3

(Kolmogorov’s Strong LLN). It is true that ${\bar{X}}_{n} \overset{a . s .}{\to} μ_{X}$ , as n → ∞. That is, $ℙ ({ω | {\bar{X}}_{n} (ω) \to μ_{X} a s n \to \infty}) = 1$ .

The LLN provides us with a recipe for estimation of quantities of interest. For a given unknown Q, we first construct a random variable X so that Q = μX = E[X], i.e., we find the probabilistic representation of the quantity Q. The random variable X is called an estimator of the quantity Q. It is an unbiased estimator of Q, meaning that E[X] = Q. Let X1, X2, . . . , Xn be i.i.d. variates all having the same probability distribution as that of X. Following the LLN, construct a sample estimator of Q, ${\bar{X}}_{n} = \frac{1}{n} \sum_{k = 1}^{n} X_{k}$ , which converges almost surely to Q, as n → ∞. That is, the quantity Q can be approximated by an average of n independent sample values:

$Q \approx {\bar{x}}_{n} = \frac{1}{n} \sum_{k = 1}^{n} x_{k},$

where xk = Xk(ω), 1 ≤ k ≤ n, for some outcome ω ∊ Ω. To obtain independent sample values {xk}k≥1, it suffices to construct a sequence of statistically independent numbers (or almost statistically independent pseudorandom numbers) sampled from the target distribution of X as follows. First, we sample from the uniform distribution Unif(0, 1). After that, the independent sample values (also called realizations or draws) xk, 1 ≤ k ≤ n, can be generated from a sequence of uniform random numbers by using a special transformation algorithm. One example of such algorithms is the inverse cumulative distribution function (CDF) method, which is based on the representation $X \overset{d}{=} F_{X}^{- 1} (U)$ , where $F_{X}^{- 1}$ is the (generalized) inverse of the CDF of X and U ~ Unif(0, 1).

In what follows, we will distinguish the sample mean estimator ${\bar{X}}_{n}$ , which is a random quantity equal to an average of n i.i.d. variates, and a sample mean estimate ${\bar{x}}_{n}$ , which is a nonrandom quantity equal to an average of n statistically independent realizations of the estimator X. While the former is important for the theoretical analysis (e.g., to construct a confidence interval), the latter is used in practice to approximate Q.

17.1.3 Approximation Error and Confidence Interval

The next goal is to estimate the approximation error. The Central Limit Theorem (CLT) provides a solution to this problem.

Theorem 17.4

(The CLT). Consider a sequence {Xk}k≥1 of i.i.d. variates with finite common variance $σ_{X}^{2}$ and expected value μX. Then,

$\frac{{\bar{X}}_{n} - μ_{X}}{σ_{X} / \sqrt{n}} \underset{\to}{d} N o r m (0, 1), a s n \to \infty .$

That is, $ℙ (\frac{{\bar{X}}_{n - μ X}}{σ_{X} / \sqrt{n}} \leq z) \to N (z)$ , as n → ∞, for all z ∊ ℝ.

With the help of the CLT, we can construct a confidence interval for the mathematical expectation μX. We have that

$ℙ (| \frac{{\bar{X}}_{n} - μ_{X}}{σ_{X} / \sqrt{n}} | \leq z) \to ℙ (| Z | \leq z) = 2 N (z) - 1 for all z \in ℝ .$

We wish to make this probability as close to one as possible. Let us fix a confidence level 1 − α ∊ (0, 1) with α ≪ 1. Solve the equation

$2 N (z) - 1 = 1 - α \Leftrightarrow N (z) = 1 - \frac{α}{2}$

for z. Since $N$ is a strictly monotone function of z, the solution is unique. It is a so-called (1 − α/2)-quantile of Norm(0, 1) denoted by zα/2. Table 17.1 contains commonly used normal quantiles.

Table 17.1

Commonly used normal quantiles zα/2 that solve $1 - N (z_{α / 2}) = \frac{α}{2}$ .

Confidence level (%)	α	zα/2
90	0.1	1.645
95	0.05	1.960
95.46	0.0454	2.000
99	0.01	2.576
99.74	0.0026	3.000
99.9	0.001	3.29

For large values of n, the distribution of ${\bar{X}}_{n}$ is approximately normal. By the CLT we obtain

$| {\bar{X}}_{n} - μ_{X} | \leq \frac{z_{α / 2} σ_{X}}{\sqrt{n}} with probability \approx 1 - α .$

Hence, $ℙ (({\bar{X}}_{n} - \frac{z_{α / 2} σ_{X}}{\sqrt{n}}, {\bar{X}}_{n} + \frac{z_{α / 2} σ_{X}}{\sqrt{n}}) ∍ μ_{X}) \approx 1 - α$ . Replacing the average ${\bar{X}}_{n}$ by its sample value ${\bar{x}}_{n}$ gives us a confidence interval for μX:

$({\bar{x}}_{n} - \frac{z_{α / 2} σ_{X}}{\sqrt{n}}, {\bar{x}}_{n} + \frac{z_{α / 2} σ_{X}}{\sqrt{n}}) ∍ μ_{X} with the confidence level of (1 - α) .$

Typically, the variance $σ_{X}^{2}$ is unknown. However, it can be approximated by the sample variance:

$σ_{X}^{2} \approx s_{n}^{2} : = \frac{1}{n} \sum_{k = 1}^{n} {(x_{k} - {\bar{x}}_{n})}^{2} = \frac{1}{n} \sum_{k = 1}^{n} x_{K}^{2} - {\bar{x}}_{n}^{2} .$

As is seen from the formula of the confidence interval, the relative error $\frac{z_{α / 2} σ_{X}}{\sqrt{n} μ_{X}}$ is of order $O (n^{- 0.5})$ , as n → ∞. This fact points out the main drawback of the Monte Carlo method—a slow rate of convergence. For example, to decrease the error by a factor of 10, the number of sample values needs to be increased to 100 times. The quantity $\frac{s_{n} / \sqrt{n}}{{\bar{x}}_{n}}$ is often referred to as an accuracy measure for the sample estimate ${\bar{x}}_{n}$ .

17.1.4 Parallel Monte Carlo Methods

One of the main advantages of the Monte Carlo method is the ease of its parallelization. In the simplest case, independent CPUs can calculate partial expectations and then a head CPU computes the final average and the confidence interval. Consider a computational cluster, where CPUs are numbered from 1 through ℓ. Let CPU # i generate ni independent draws, ${x_{1}^{i}, ..., x_{n_{i}}^{i}}$ . The partial averages

${\bar{x}}^{(i)} = \frac{x_{1}^{i} + x_{2}^{i} + \dots + x_{n_{i}}^{i}}{n_{i}} and {\bar{y}}^{(i)} = \frac{{(x_{1}^{i})}^{2} + {(x_{2}^{i})}^{2} + \dots + {(x_{n_{i}}^{i})}^{2}}{n_{i}}$

are calculated and then their values are sent to the head CPU. The total number of draws is n = n1 + n2 + · · · + nk. To construct a confidence interval for the mean value, we only need to compute the sample mean ${\bar{x}}_{n}$ and the sample variance $s_{n}^{2}$ as follows:

$\begin{array}{l} {\bar{x}}_{n} = \frac{1}{n} \sum_{i = 1}^{ℓ} n_{i} {\bar{x}}^{(i)} = \frac{1}{n} \sum_{i = 1}^{ℓ} \sum_{j = 1}^{n_{i}} x_{j}^{i} \\ s_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{ℓ} n_{i} {\bar{y}}^{(i)} - {\bar{x}}_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{ℓ} \sum_{j = 1}^{n_{i}} {(x_{j}^{i})}^{2} - {\bar{x}}_{n}^{2} . \end{array}$

Note that the numbers ni, i = 1, 2, . . . , ℓ, can be chosen so that all CPUs will finish their jobs almost simultaneously. If the CPUs we deal with have a similar performance, then we can simply set n1 = n2 = · · · = nℓ.

17.1.5 One Monte Carlo Application: Numerical Integration

Suppose that we wish to evaluate a definite integral of an integrable function g : ℝ → ℝ on a finite interval $[a, b] : I = \int_{a}^{b} g (x) d x$ . Let us consider several Monte Carlo methods of approximating I.

The “hit-or-miss” method is based on the geometric interpretation of a definite integral. Suppose that the function g is nonnegative and bounded: 0 ≤ g ≤ M. If a point with Cartesian coordinates (X, Y) is chosen uniformly on the rectangle R = [a, b] × [0, M], then the probability of the event {Y ≤ g(X)} (i.e., the event that (X, Y) belongs to the domain D bounded by the graph of g and the lines x = a, x = b, and y = 0) is equal to the ratio of two areas, $\frac{| D |}{| R |}$ . Since |D| = I and |R| = (b − a)M, we obtain the following approximation:

$I \approx (b - a) M \frac{N_{D} (n)}{n},$

where n statistically independent random points (xk, yk), 1 ≤ k ≤ n, are sampled uniformly on the rectangle R, and ND(n)/n is the fraction of points belonging to D. This result can be viewed as an application of the LLN. Let us introduce an indicator function $I_{D} (X, Y)$ , which equals 1 if (X, Y) ∊ D and equals 0 otherwise. Then, the expected value of this indicator is $E [I_{D} (X, Y)] = ℙ (Y \leq g (X)) = \frac{| D |}{| R |}$ . Therefore, $E [| R |] . I_{D} (X, Y)] = | D | = I$ and applying the LLN gives

$\frac{1}{n} \sum_{k = 1}^{n} | R | I_{D} (x_{k}, y_{k}) = (b - a) M \frac{N_{D} (n)}{n} \to I, a s n \to \infty .$

For the sample mean method, the representation of the integral I in the form of a mathematical expectation w.r.t. a uniform probability distribution is used:

$I = (b - a) \int_{a}^{b} g (x) \frac{1}{b - a} d x = E [(b - a) g (X)],$

where X ~ Unif(a, b). Therefore, by the LLN we have $I \approx \frac{1}{n} \sum_{k = 1}^{n} (b - a) g (x_{k})$ , where {xk}k≥1 are independent draws from Unif(a, b). In comparison with the “hit-or-miss” method, g is only required to be integrable on the interval (a, b).

The weighted method generalizes the previous approach. Suppose that there exists a probability density function (PDF) f whose support is (a, b), where (a, b) can be a finite, semi-infinite, or infinite interval. Then, the integral I can be represented in the form of a mathematical expectation w.r.t. the PDF f:

$I = \int_{a}^{b} \frac{g (x)}{f (x)} f (x) d x = E [\frac{g (X)}{f (X)}], where X ~ f,$

provided that the ratio $\frac{g (x)}{f (x)}$ is defined for all x ∊ (a, b) and is an integrable function of x.

The random variable $\frac{g (X)}{f (X)}$ is the estimator of I. Thus, we approximate $I \approx \frac{1}{n} \sum_{k = 1}^{n} \frac{g (x_{k})}{f (x_{k})}$ where {xk}k≥1 are independent draws from the PDF f.

Clearly, there are many PDFs that can be used in the weighted method. The importance sampling principle, which is discussed in Section 17.5, explains how to chose f. The optimal PDF f that minimizes the variance of the random estimator $\frac{g (X)}{f (X)}$ , where X ~ f, i.e., the PDF that solves the minimization problem

$Var (\frac{g (X)}{f (X)}) \to \min_{f},$

is a function proportional to |g|:

$f (x) = \frac{| g (x) |}{\int_{a}^{b} | g (t) | d t}, x \in (a, b) .$

17.2 Generation of Uniformly Distributed Random Numbers

A central part of every stochastic simulation algorithm is a generator of random numbers, which produces a sequence of statistically independent samples (or draws) from a given distribution. As we demonstrate in the following sections, the sampling from a nonuniform probability distribution can be reduced to the sampling from the uniform distribution on (0, 1), denoted Unif(0, 1), by applying certain procedures such as transformation methods and acceptance-rejection techniques. Our main priority is to have available a reliable method of sampling from Unif(0, 1). Therefore, we first concentrate on methods of generating statistically independent (pseudo-)random numbers uniformly distributed on (0, 1).

The process of obtaining truly random numbers can be quite complicated. There exist different types of physical generators of random numbers. The simplest ones are balanced coins and dice, and even a playing roulette. More advanced generators are based on the use of physical phenomena such as thermal noise. However, there are common drawbacks of such hardware generators such as their slow speed, lack of portability, possible nonuniformness of numbers generated, and impossibility of reproducing the same sequence of draws. Algorithmic (software) generators of random numbers solve most of these issues although they create new ones. Numbers generated by such algorithms only mimic truly random numbers, hence the generated draws are called pseudo-random numbers (PRNs). However, the statistical properties of software generators are put to scrutiny so we can trust the numbers obtained.

17.2.1 Uniform Probability Distributions

As is well known, the distribution of a continuous random variable X can be characterized by its PDF, denoted fX, or a cumulative distribution function (CDF), denoted FX. For a continuous random variable U uniformly distributed on (0, 1),

$f_{U} (x) = I_{(0, 1)} (x) = {\begin{array}{l} 1 if 0 < x < 1, \\ 0 otherwise, \end{array} F_{U} (X) = \int_{- \infty}^{x} = {\begin{array}{l} 0 x \leq 0, \\ x 0 < x < 1, \\ 11 \leq x . \end{array}$

The mathematical expectation and variance of U are, respectively,

$E [U] = \int_{- \infty}^{\infty} x f u (x) d x = \int_{0}^{1} x d x = \frac{1}{2}, Var (U) = E [U^{2}] - {(E [U])}^{2} = \frac{1}{3} - \frac{1}{4} = \frac{1}{12} .$

The continuous uniform distribution on (a, b) with a < b, denoted Unif(a, b), reduces to the case with the interval (0, 1) as follows. Let X ~ Unif(a, b) and U ~ Unif(0, 1). Then, we have

$X \overset{d}{=} a + (b - a) U, F_{X} (x) = F_{U} (\frac{x - a}{b - a}), f_{X} (x) = \frac{1}{b - a} f_{U} (\frac{x - a}{b - a}) = \frac{1}{b - a} I_{(a, b)} (x) .$

Now consider the multidimensional case. Suppose that a point X is chosen at random in a domain D ⊂ ℝm with a finite volume |D| so that X is equally likely to lie anywhere in D. Then, the random vector X = [X1, X2, . . . , Xm]⊤ is said to have a multivariate uniform distribution in D; its multivariate PDF is

$f_{X} (x) = \frac{1}{| D |} I_{D} (x) = {\begin{array}{l} \frac{1}{| D |} x \in D, \\ 0 otherwise, \end{array} x = {[x_{1} x_{2} \dots x_{m}]}^{⊺} \in D .$

In the case of a hyperparallelepiped D = (a1, b1) × (a2, b2) × · · · × (am, bm) with ai < bi, 1 ≤ i ≤ m, we have

$f x (x) = \prod_{i = 1}^{m} f x_{i} (x_{i}) = \prod_{i = 1}^{m} \frac{1}{b_{i} - a_{i}} I_{(a_{i}, b_{i})} (x_{i}) .$

Since the multivariate PDF fX is a product of m univariate PDFs $f_{X_{i}} = \frac{1}{b_{i} - a_{i}} I_{(a_{i}, b_{i})}$ , the entries of the vector X are independent uniformly distributed random variables. Therefore, the simulation of a vector uniformly distributed in a hyperparallelepiped reduces to sampling from Unif(0, 1).

17.2.2 Linear Congruential Generator

Many algorithms for generating pseudo-random numbers (PRNs for short) uniformly distributed in [0, 1] have the form of an iterative rule, ut+1 = F (ut), t = 0, 1, . . ., where both the range and the domain of the function F are [0, 1] and the initial seed u0 ∊ [0, 1] is given. Suppose that a sequence of PRNs {ut}t≥0 is generated by such a rule. Combine the numbers in pairs to obtain points (ut, ut−1) ∊ [0, 1]2, t = 1, 3, 5, . . . . On the one hand, these points are situated on the curve y = F(x). On the other hand, they have to be uniformly distributed in the unit square [0, 1]2 so the points will fill out the square without leaving gaps. Therefore, the plot of F should provide a sufficiently dense filling of the square. One example of a function that has such a property is the mapping y = {Mx}, where {x} denotes the fractional part of x, i.e., ${x} = x - ⌊ x ⌋$ , where $⌊ \cdot ⌋$ is the floor function, and M is a large positive number called a multiplier. The plot of y = {Mx} consists of parallel segments and the distance between them goes to zero as M → ∞ (see Figure 17.2).

Figure 17.2

Figure showing the plot of y = {Mx} with M = 15.

The plot of y = {Mx} with M = 15.

Proposition 17.5

(Voitishek and Mikhaĭlov (2006)). Consider the transformation y = {Mx + a} where {x} denotes the fractional part of x, and M ∊ Z and a ∊ ℝ are positive constants.

If U ~ Unif(0, 1), then {MU + a} ~ Unif(0, 1).
Let U0 ~ Unif(0, 1) and the sequence {Uk}k≥1 be generated from the rule Uk+1 = {MUk}. Then, Uk ~ Unif(0, 1) and Corr(U0, Uk) = Corr(Un, Un+k) = M−k for all n ≥ 0 and k ≥ 0.

Proof. First, we prove the fact that X := {MU} ~ Unif(0, 1). By the definition of the fractional part function, we have X ∊ [0, 1). Moreover, X = 0 iff MU is an integer what happens with probability zero. Hence ℙ(X ∊ (0, 1)) = 1. For x ∊ (0, 1), we have

$ℙ (X \leq x) = \sum_{k = 0}^{M - 1} ℙ (k \leq M U \leq k + x) = \sum_{k = 0}^{M - 1} ℙ (\frac{k}{M} \leq U \leq \frac{k}{M} + \frac{x}{M}) = \sum_{k = 0}^{M - 1} \frac{x}{M} = x .$

Here, we use the fact that 0 ≤ MU ≤ M and {z} ∊ [0, x] iff $z \in \cup_{k \in ℤ} [k, k + x]$ . Thus, the CDF of X is FX(x) = x for x ∊ (0, 1). Therefore, X ~ Unif(0, 1).

Second, let us prove that {MU + a} ~ Unif(0, 1). Note that

${M U + a} = {{M U} + {a}} \overset{d}{=} {U + b},$

where b = {a} ∊ [0, 1). Show that ℙ({U + b} ≤ x) = x for x ∊ (0, 1). Consider two cases. (x ≤ b): Since {U + b} ≤ x iff 1 ≤ U + b ≤ 1 + x, we have

$ℙ ({U + b} \leq x) = ℙ (\underset{\in (0, 1)}{\underset{︸}{1 - b}} \leq U \leq \underset{\in (0, 1)}{\underset{︸}{1 + x - b}}) = (1 + x - b) - (1 - b) = x .$

(b < x): Since {U + b} ≥ x iff x ≤ U + b ≤ 1, we have

$ℙ ({U + b} \geq x) = ℙ (\underset{\in (0, 1)}{\underset{︸}{x - b}} \leq U \leq \underset{\in (0, 1)}{\underset{︸}{1 - b}}) = (1 - b) - (x - b) = 1 - x .$

Therefore, ℙ({U + b} ≤ x) = 1 − ℙ({U + b} ≥ x) = 1 − (1 − x) = x.

Finally, let us prove the last assertion. By induction, all Uk ~ Unif(0, 1), k = 0, 1, 2, . . . . Clearly, for any k ≥ 1, the pairs (Un, Un+k), n ≥ 0, all have the same distribution. Hence, we only need to prove that Corr(U0, Uk) = M−k for all k ≥ 0. Denote rk = Corr(U0, Uk). Let us show that $r_{k} = \frac{1}{M} r_{k - 1} for k \geq 1$ . Since r0 = Corr(U0, U0) = 1, the assertion will be proved by induction. A uniform random variable MU ~ Unif(0, M) can be expressed as a sum of its fractional and integer parts: $M U = ⌊ M U ⌋ + {M U}$ . Clearly, $⌊ M U ⌋ ~ U n i f {0, 1, ..., M - 1} (since ℙ (⌊ M U ⌋) = k) = ℙ (M U \in [k, k + 1)) = \frac{1}{M}$ for 0 ≤ k ≤ M − 1) and {MU} ~ Unif(0, 1) (as proved above). For any k = 0, 1, . . . , M − 1 and any x ∊ (0, 1), we have

$ℙ (⌊ M U ⌋ = k; {M U} \leq x) = ℙ (M U \in [k; k + x]) = \frac{x}{M} = ℙ (⌊ M U ⌋ = k) ℙ ({M U} \leq x) .$

Therefore, the fractional and integer parts are independent random variables. It is also true that $E [⌊ M U ⌋] =^{(M - 1)} / 2$ and $Var (⌊ M U ⌋) =^{(M^{2} - 1)} / \sqrt{12}$ . Moreover, we have

$\begin{matrix} \frac{U - E [U]}{\sqrt{Var (U)}} = & \frac{U - \frac{1}{2}}{\frac{1}{\sqrt{12}}} = \frac{M U - \frac{M}{2}}{\frac{M}{\sqrt{12}}} \\ = & \frac{⌊ M U ⌋ + {M U} - \frac{(M - 1)}{2} - \frac{1}{2}}{\frac{M}{\sqrt{12}}} \\ = & \sqrt{\frac{M^{2} - 1}{M^{2}}} (\frac{⌊ M U ⌋ - \frac{(M - 1)}{2}}{\sqrt{\frac{(M^{2} - 1)}{12}}}) + \frac{1}{M} (\frac{{M U} - \frac{1}{2}}{\frac{1}{\sqrt{12}}}) . \end{matrix}$

Now we are ready to calculate rk:

$r_{k} = Corr (U_{0}, U_{k}) = \sqrt{\frac{M^{2} - 1}{M^{2}}} Corr (⌊ M U_{0} ⌋, U_{k}) + \frac{1}{M} Corr ({M U_{0}}, U_{k}) .$

We proved that $⌊ M U_{0} ⌋$ and {MU0} = U1 are independent random variables. Moreover, since Uk = {MUk−1} = {M {MUk−2}} = · · · = g(U1) for some g : ℝ → ℝ, the random variable Uk as a function of U1 is independent of $⌊ M U_{0} ⌋$ as well. Thus, $Corr (⌊ M U_{0} ⌋, U_{k}) = 0$ and

$\begin{array}{l} r_{k} = & Corr (U_{0}, U_{k}) = \frac{1}{M} Corr ({M U_{0}}, U_{k}) = \frac{1}{M} Corr (U_{1}, U_{k}) \\ = & \frac{1}{M} Corr (U_{0}, U_{k - 1}) = \frac{1}{M} r_{k - 1} . \end{array}$

According to Proposition 17.5, a sequence {Ut, t = 0, 1, . . .} of random numbers uniformly distributed in (0, 1) can be generated from one random number U0 ~ Unif(0, 1) by using the rule Ut = {MUt−1} for t ≥ 1. We shall call this method a multiplicative method. Although the numbers obtained are dependent, the correlation between any two members of the sequence is negligible and rapidly goes to zero as the distance between the numbers in the sequence increases. Thus, the transformation y = {Mx + a} with a suitable choice of M and a can be used to generate pseudo-random numbers having good statistical properties such as uniformity and independence of samples. A drawback of the multiplicative method is that multidimensional tuples formed from the sequence {Ut}t≥0 lie on a family of multidimensional hyperplanes in the unit hypercube. For example, the points (U0, U1), (U2, U3), . . . lie on a family of M parallel lines.

The linear congruential generator (LCG) of PRNs is one of the oldest and simplest methods. It was proposed by Lehmer (1951). This algorithm is based on the mapping y = {Mx + a} but to avoid round-off errors the generating rule is written in a different form. As an example, let us consider the function y = {Mx} and suppose that x is a ratio of two integers: $x = \frac{s}{m}$ . Then,

${M \frac{s}{m}} = \frac{M s}{m} - ⌊ \frac{M s}{m} ⌋ = \frac{M s - ⌊ \frac{M s}{m} ⌋ m}{m} = \frac{M s \mod m}{m} .$

The operation (s mod m) returns the remainder of an integer s after division by m:

$s \mod m : = s - ⌊ s / m ⌋ \cdot m$

This operation is also called the reduction of s by modulo m; the result is called the residue of s modulo m.

The LCG works as follows. First, a sequence {st, t = 0, 1, . . .} of integers is generated. The initial number s0 (called the initial seed) is chosen by the user and all subsequent numbers are calculated from the rule

$s_{t} = (M s_{t - 1} + a) \mod m, t = 1, 2, ... (17.1)$

By construction, all integers st lie between 0 and m − 1. Then, the pseudo-random numbers in [0, 1) are obtained by dividing these integers by m:

$u_{t} = \frac{s_{t}}{m}, t = 0, 1, 2, ... (17.2)$

The LCG is equivalent to the recurrence uk = {Mut−1 + a}, where all ut are of the form $\frac{s}{m}$ with integers s and m, but it avoids round-off errors. Here the multiplier M, the increment a, and the modulus m are integers so that 1 ≤ M < m, and 0 ≤ a < m. The initial seed s0 is an integer from {0, 1, . . . , m − 1}. If a = 0, then the generator is called a multiplicative congruential generator (MCG). It operates on multiplicative group of integers modulo m. The generating rule of the MCG is

$s_{t} = M s_{t - 1} \mod m, u_{t} = \frac{s_{t}}{m}, t = 1, 2, ... (17.3)$

To guarantee that numbers produced from the rule (17.3) are all nonzero, the initial seed s0 has to be nonzero. Otherwise, all st = 0, t ≥ 1, if s0 = 0.

Example 17.1.

Construct PRNs generated by the LCG with M = 11, m = 16, a = 5, and s0 = 0.

Solution. The LCG rule is $u_{t} = \frac{s_{t}}{16}$ , where st = (11st−1 + 5) mod 16. We first obtain the sequence of integers st, t ≥ 0:

$\begin{matrix} S_{1} = (11 \cdot 0 + 5) \mod 16 = 5 \mod 16 = 5, s_{2} = (11 \cdot 5 + 5) \mod 16 = 60 \mod 16 = 12, \\ s_{3} = (11 \cdot 12 + 5) \mod 16 = 137 \mod 16 = 9, s_{4} = (11 \cdot 9 + 5) \mod 16 = 104 \mod 16 = 8, \\ s_{5} = (11 \cdot 8 + 5) \mod 16 = 93 \mod 16 = 13, s_{6} = (11 \cdot 13 + 5) \mod 16 = 148 \mod 16 = 4, \\ s_{7} = (11 \cdot 4 + 5) \mod 16 = 49 \mod 16 = 1, s_{8} = (11 \cdot 1 + 5) \mod 16 = 16 \mod 16 = 0, \\ s_{9} = (11 \cdot 0 + 5) \mod 16 = 5 \mod 16 = 5, ... \end{matrix}$

As is seen, the sequence obtained is periodic. The numbers repeat themselves after eight steps. The PRNs $u_{t} = \frac{s_{t}}{m}, t \geq 0$ , are

$0, \frac{5}{16}, \frac{12}{16}, \frac{9}{16}, \frac{8}{16}, \frac{13}{16}, \frac{4}{16}, \frac{1}{16}, 0, \frac{5}{16}, ...$

The quality of the LCG depends on the choice of parameters M, a, and m. Often the modulus m is chosen as a prime number. Then, all calculations are done in the finite field Zm. Preferred moduli are Mersenne primes of the form 2r − 1, e.g., 231 − 1 = 2,147,483,647. Sometimes, m is a power of 2 since calculations can be done faster by exploiting the binary structure of computer arithmetic. Here are some choices of m and M for the MCG:

m = 231 − 1 and M = 16807 (Park and Miller (1988));
m = 240 and M = 517 (Ermakov and Mikhaĭlov (1982));
m = 2128 and M = 5100109 mod 2128 (Dyadkin and Kenneth (2000)).

Since the set {0, 1, . . . , m − 1} is finite, the sequence generated by an LCG is periodic. That is, st+r = st holds for all t and some integer r > 0 called a period of the sequence. Let ℓ be the least possible period, which is called the length of period. Because there are only m possible different values of st, the maximum possible period of an LCG is m (or m − 1 if a = 0).

Proposition 17.6

(Ermakov (1975)). The maximum possible length of period for the sequence st = Mst−1 mod 2p, t ≥ 1, with p ≥ 3 is ℓ = 2n−2. The maximum period length is achieved if M ≡ 3 mod 8 or M ≡ 5 mod 8 and the seed s0 is odd.

There are many ways of improving a classical linear congruential generator. Let us consider some of them (a more comprehensive review of modern PRNGs can be found in L’Ecuyer (2012)). To increase the maximum period, one can combine two (or more) LCG methods as follows. Let {ut(1)}t≥1 and {ut(2)}t≥1 be the outputs of two LCGs. A new sequence {ut}t≥1 is given by

$u_{t} : = {u_{t}^{(1)} + u_{t}^{(2)}}, t \geq 0.$

Another generalization of the classical LCG is a multiple recursive generator defined by:

$s_{t} = (M_{1} s_{t - 1} + \dots + M_{k} s_{t - k}) \mod m, u_{t} = \frac{s_{t}}{m}, t \geq k,$

where M1, . . . , Mk are multipliers selected from $S : = {0, 1, ..., m - 1}$ ; k ≥ 2 is the order of recursion, and (s0, s1, . . . , sk−1) is the seed sequence with $s_{i} \in S$ for 0 ≤ i ≤ k − 1. The maximum period length for this generator is mk − 1.

One of important properties of the MCG is the ease of its parallelization. Suppose that a sequence {st}t≥0 is generated by (17.3). To share this sequence among several independent processors, we split it into several disjoint subsequences. This can be achevied by using different seeds situated far apart along the original sequence. The seeds s0(j), j ≥ 0, that start respective subsequences of length K are generated by using the leaping-frog generator:

$s_{0}^{(j)} = A s_{0}^{(j - 1)} \mod m, where s_{0}^{(0)} \equiv s_{0} and A \equiv M^{K} \mod m .$

Once the seeds are generated, the original generator (17.3) is used to obtain disjoint subsequences starting from these seeds:

$s_{t}^{(j)} = M s_{t - 1}^{(j)} \mod m, u_{t}^{(j)} = \frac{s_{t}^{(j)}}{m}, t = 1, 2, ..., K, j \geq 0.$

The maximum number of processors that can be served by such a leaping-frog generator cannot exceed the ratio $\frac{ℓ}{K}$ of the length of period to the length of an individual subsequence.

17.3 Generation of Nonuniformly Distributed Random Numbers

Suppose that we have a good PRN generator for the uniform distribution Unif(0, 1). However, our ultimate goal is to be able to sample from any given probability distribution. This can be achieved by transforming uniform PRNs into nonuniform random numbers. We are interested in general transformation methods that work for large classes of distributions. The transformation methods should be fast and efficient and should not use too much memory. In the sequel we are going to consider three main groups of sampling algorithms, namely, inversion methods, composition methods, and acceptance-rejection methods.

17.3.1 Transformations of Random Variables

Recall that a (univariate) random variable defined on a probability space (Ω, ℱ, P) is a measurable function from Ω to ℝ. The support of X is defined as the smallest closed set $S$ whose compliment $S^{c}$ has probability zero: $ℙ (X \in S^{c}) = 0$ . The cumulative distribution function (CDF) of a random variable X is a function FX from ℝ to [0, 1] that is defined by FX(x) = ℙ(X ≤ x) for x ∊ ℝ. A CDF F satisfies the following properties:

F is a nondecreasing, right-continuous (i.e., F (x+) = F (x)) function;
F (−∞) = 0 and F (+∞) = 1.

A random variable X (and its CDF FX) is said to be

Discrete, if there exists a nonnegative function pX called a probability mass function (PMF for short) with a countable support such that

$F_{X} (x) = \sum_{y \leq x : p_{X} (y) \neq 0} p_{X} (y) for x \in ℝ .$

The CDF FX is a piecewise constant function with jumps.

(Absolutely) continuous, if there exists a nonnegative integrable function fX called a probability density function (PDF for short) such that

$F_{X} (x) = \int_{- \infty}^{x} f_{X} (x) d x for x \in ℝ .$

The CDF FX is a continuous function (without jumps).

Since FX(∞) = 1, a PMF p satisfies $\sum_{x \in S} p (x) = 1$ , and a PDF f ≥ 0 satisfies $\int_{- \infty}^{\infty} f (x) d x = 1$ . For a discrete random variable, the support is the smallest countable collection of points $S$ so that $ℙ (X \in S) = 1$ . It is defined by $S = {x \in ℝ : p_{X} (x) \neq 0}$ . Typically, the support $S$ of a univariate continuous probability distribution is an interval of the real line. Since for a continuous random variable X the mass probability ℙ(X = x) is zero for any x ∊ ℝ, the support may be chosen to be an open interval.

Often, one random variable can be expressed as a function of another variate, say Y = f(X). Then sample values of Y can be obtained by applying the mapping f to sample values of X: yi = f(xi), i ≥ 1. Let us consider several useful examples.

A linear (or affine) mapping f(x) = α + βx. The density of Y = α + βX is given by $f_{Y} (x) = \frac{1}{β} f_{X} (\frac{x - α}{β})$ . A normal random X ~ Norm(μ, σ2) can be expressed in terms of a standard normal variate Z ~ Norm(0, 1) as X = μ + σZ. Similarly, a uniform variable X ~ Unif(a, b) with a < b is a function of U ~ Unif(0, 1) given by X = a + (b − a)U.

A power mapping f(x) = xc for x > 0. Let X be a nonnegative random variable. The density of Y = Xc is then $f_{Y} (x) = \frac{1}{c} x^{\frac{1}{c} - 1} f (x^{\frac{1}{c}})$ . For example, the density of Uc, where U ~ Unif(0, 1), is $f (x) = \frac{1}{c} x^{\frac{1}{c} - 1} I_{(0, 1)} (x)$ . A special case is a reciprocal of a nonzero variate: $Y = \frac{1}{X}$ . The PDF is $f_{Y} (x) = \frac{1}{x^{2}} f (\frac{1}{x})$ .

An exponential mapping f(x) = exp(x). The PDF of Y = eX is $f_{Y} (x) = \frac{1}{x} f (\ln x)$ for x > 0. For example, the log-normal random variable is an exponential of a normal variate: Y = eμ+σZ, where Z ~ Norm(0, 1).

17.3.2 Inversion Method

The inversion method of generating nonuniform random numbers is based on the analytical or numerical inversion of the CDF. This method is most preferable if we wish to use it with low-discrepancy (quasi-random) sequences of numbers. Also it is compatible with some control variate techniques such as the antithetic variate method. The inversion method can be applied to both discrete and continuous distributions; however, its efficiency (in comparison with other methods) depends on how fast the inverse CDF can be evaluated.

17.3.2.1 Inverse Distribution Function

Consider a random variable X that has a continuous and strictly increasing on its support CDF F. Thus, the inverse function F−1 is well-defined and is also a strictly increasing function. Let us find the distribution function of Y := F(X). By using the definitions of a CDF and an inverse function, we obtain

$F_{Y} (y) = ℙ (F (X) \leq y) = ℙ (F^{- 1} (F (X)) \leq F^{- 1} (y)) = ℙ (X \leq F^{- 1} (y)) = F (F^{- 1} (y)) = y$

for all y ∊ (0, 1). As is seen, the function FY is the CDF of a continuous random variable uniformly distributed on (0, 1). That is, F (X) ~ Unif(0, 1). This result provides us with a simple algorithm for sampling from continuous strictly increasing CDFs. Algorithm 17.1 is very simple and transparent. However, it relies on the knowledge of the inverse CDF F−1 in closed form (or the ability to compute F−1 in efficient manner). Figure 17.3 illustrates the method.

Figure 17.3

Figure showing the inverse CDF method.

The inverse CDF method.

Algorithm 17.1

The Inverse CDF Method.

(1) Obtain a draw u from the unform distribution Unif(0, 1).
(2) A draw x from a CDF F is given by x = F−1(u).

Example 17.2.

Using the inverse CDF method, find generating formulae for

(a) the uniform distribution Unif(a, b), a < b;
(b) the exponential distribution Exp(λ), λ > 0;
(c) the power distribution with the PDF $f (x) = c x^{c - 1} I_{(0, 1)} (x), c > 0$ ;
(d) the Weibull distribution with the PDF $f (x) = \frac{α}{β} x^{α - 1} e^{- x^{α} / β} I_{ℝ_{+}} (x), α, β > 0$ .

Solution.

(a) The CDF of Unif(a, b) is $F (x) = \frac{x - a}{b - a}$ for x ∊ (a, b). Solve $\frac{x - a}{b - a} = u$ with u ∊ (0, 1) for x to find the inverse CDF: F−1(u) = a + (b − a)u. So the generating formula is X = a + (b − a)U.
(b) The CDF of the exponential distribution with rate λ > 0 is F(x) = 1 − e−x, x > 0. Solve 1 − e−λx = u for x to obtain $x = - \frac{1}{λ} \ln (1 - u)$ . If U ~ Unif(0, 1), then 1 − U ~ Unif(0, 1). So the generating formula for X ~ Exp(λ) simplifies: $X = - \frac{\ln U}{λ}$ .
(c) Integrate $f (x) = c x^{c - 1} I_{(0, 1)} (x)$ on (0, x) for 0 < x < 1, to obtain

$F (x) = \int_{0}^{x} f (x) d x = x^{c} .$

Thus, F−1(u) = u1/c and X = U1/c.
(d) The CDF is

$F (x) = \int_{0}^{x} \frac{α}{β} x^{α - 1} e^{- x^{α} / β} d x = 1 - e^{- x^{α} / β} for x > 0.$

Its inverse is F−1(u) = (−βln(1 − u))1/α. Thus, we obtain

$X = F^{- 1} (1 - U) = {(- β \ln U)}^{1 / α} .$

To generalize the inverse CDF sampling method to the case of any probability distribution, we define the generalized inverse CDF F−1 : (0; 1) → ℝ

$F^{- 1} (u) = \inf {x \in ℝ : u \leq F (x)} . (17.4)$

To justify this formula, let us consider two special cases. First, suppose that a CDF F of a random variable X has a jump discontinuity at x0, i.e., F(x0−) < F(x0). Therefore, ℙ(X = x0) = F(x0) − F(x0−) ≠ 0. The formula in (17.4) gives that

$\begin{array}{l} F^{- 1} (u) = x_{0} for F (x_{0} −) \leq u \leq F (x_{0}), \\ F^{- 1} (u) < x_{0} for u < F (x_{0} −), \\ F^{- 1} (u) > x_{0} for u > F (x_{0}) . \end{array}$

Thus, the random variable F−1(U) has a nonzero mass probability at x0 and

$ℙ (F^{- 1} (U) = x_{0}) = F (x_{0}) - F (x_{0} −)$

as expected. Now, suppose that F has a flat section on [x0, x1] with x0 < x1. There exists u0 ∊ [0, 1] such that F(x) = u0 for x ∊ (x0, x1), F (x) ≤ u0 for x ≤ x0, and F(x) ≥ u0 for x ≥ x1. In this case, we have ℙ(x0 < X < x1) = F (x1−)−F (x0) = 0. Then, the generalized inverse has a jump at u0: F−1(u0−) ≤ x0 and F−1(u0) ≥ x1. Thus, with probability zero the random variable F−1(U) takes on a value in (x0, x1) as expected. Now, as we can see, the same sampling formula X = F−1(U), U ~ Unif(0, 1), with a generalized inverse CDF in (17.4), works for any CDF F , including those having a jump discontinuity or a flat section.

Figure 17.4

Figure showing the plot of a CDF (the left plot) and its generalized inverse (the right plot). A mixture of a discrete distribution and a continuous distribution is considered. Note that a CDF is a right-continuous function and a generalized inverse CDF is a left-continuous function.

The plot of a CDF (the left plot) and its generalized inverse (the right plot). A mixture of a discrete distribution and a continuous distribution is considered. Note that a CDF is a right-continuous function and a generalized inverse CDF is a left-continuous function.

Example 17.3.

Using the inverse CDF method, find a generating formula for

(a) a Bernoulli random variable $X = {\begin{matrix} 0 & with probability 1 - p, \\ 1 & with probability p; \end{matrix}$
(b) a random variable with the PMF

$f (x) = p_{1} I_{{x_{1}}} (x) + p_{2} I_{{x_{2}}} (x) + p_{3} I_{{x_{3}}} (x),$

where pi, i = 1, 2, 3, are positive probabilities so that p1 + p2 + p3 = 1.

Solution.

(a) Sample U ~ Unif(0, 1). If U < p, then set X = 1; otherwise set X = 0. Verify that X has the Bernoulli distribution: ℙ(X = 1) = ℙ(U < p) = p and ℙ(X = 0) = ℙ(U ≥ p) = 1 − p.
(b) Suppose that x1 < x2 < x3. The CDF of X ~ f and the generalized inverse CDF are, respectively,

$F_{X} (x) = {\begin{array}{l} 0 if x < x_{1}, \\ p_{1} if x_{1} \leq x < x_{2}, \\ p_{1} + p_{2} if x_{2} \leq x < x_{3}, \\ 1 if x_{3} \leq x, \end{array} F_{X}^{- 1} (u) = {\begin{array}{l} x_{1} if u \leq p_{1}, \\ x_{2} if p_{1} < u \leq p_{1} + p_{2}, \\ x_{3} if p_{1} + p_{2} < u, \end{array}$

for x ∊ ℝ and u ∊ (0, 1). Sample U ~ Unif(0, 1) and set

$X = F_{X}^{- 1} (U) = {\begin{array}{l} x_{1} if U \leq p_{1}, \\ x_{3} if U > p_{1} + p_{2}, \\ x_{2} otherwise . \end{array}$

Example 17.4.

Justify the following method of sampling from the discrete uniform distribution on a set of N distinct numbers x1, x2, . . . , xN:

(i) generate U ~ Unif(0, 1);
(ii) set X = xK, where $K = ⌊ N \cdot U + 1 ⌋$ .

Solution. We need to show that $ℙ (X = x_{k}) = \frac{1}{N}$ for any k = 1, 2, . . . , N. Indeed,

$\begin{array}{l} ℙ (X = x_{k}) = & ℙ (⌊ N \cdot U + 1 ⌋ = k) = ℙ (k \leq N \cdot U + 1 < k + 1) \\ = & ℙ (\frac{k - 1}{N} \leq U < \frac{k}{N}) = \frac{k}{N} - \frac{k - 1}{N} = \frac{1}{N} . \end{array}$

17.3.2.2 The Chop-Down Search Method

Consider a general discrete random variable with a countable support $S = {x_{j}}_{j \geq 1}$ and mass probabilities {pj}j≥1, where pj = ℙ(X = xj) > 0 and Σj≥1pj = 1. The CDF F of such a discrete probability distribution is a piecewise-constant function. Let us assume that the mass points xj are sorted in increasing order: x1 < x2 < . . .. Then the CDF is given by $F (x) = \sum_{j : x_{j} \leq x} p_{j}$ . Hence the generalized inverse CDF is calculated as follows:

$F^{- 1} (u) = \inf {x_{k} \in S : u \leq \sum_{j = 1}^{k} p_{j}} = {x_{k} \in S : \sum_{j = 1}^{k - 1} p_{j} < u \leq \sum_{j = 1}^{k} p_{j}} (17.5)$

The requirement that mass points xj are sorted in increasing order is not necessary for the application of the inversion method. We can consider any arrangement for {xj}j≥1, since the sampling of X is equivalent to the sampling of a random index $K \in ℕ$ with probabilities ℙ(K = j) = pj, j ≥ 1. First, generate U ~ Unif(0, 1). Second, find the index K ≥ 1 such that

$\sum_{j = 1}^{K - 1} p_{j} < U \leq \sum_{j = 1}^{K} p_{j} . (17.6)$

Finally, set X = xK. Note that the probability of the event that U satisfies the above double inequality is exactly pK. One of possible implementations of this approach is the chop-down search (CDS) algorithm (see Algorithm 17.2).

The number of cycles in the chop-down search method is equal to the expected value $\sum_{j \geq 1} j p_{j}$ . Indeed, if U ∊ (0, p1], then the algorithm stops after one cycle, and this happens with probability p1 = ℙ(U ∊ (0, p1]). If U ∊ (p1, p1 + p2], then the algorithm stops after two cycles, and this happens with probability p2 = ℙ(U ∊ (p1, p1 +p2]), and so on. Let cU be the computational cost of the generation of one draw from Unif(0, 1) and cI the computational cost of one cycle of the method. Then, the total cost is

$c_{U} + c_{I} \sum_{j \geq 1} j p_{j} .$

Algorithm 17.2 The Chop-Down Search (CDS) Method.

input: the mass points {xj}j≥1 and probabilities {pj}j≥1

generate U ← Unif(0, 1)

set K ← 0

repeat

set K ← K + 1

set U ← U − pK

until U ≤ 0

return X = xK

Example 17.5.

Find the computational cost of the CDS method for

(a) the geometric distribution Geom(p) with ℙ(X = j) = (1 − p)j−1p, j ≥ 1, 0 < p < 1;
(b) the Poisson distribution Pois(λ) with $ℙ (X = j) = \frac{λ^{j}}{j!} e^{- λ}, j \geq 0, λ > 0$ .

Solution. Compute the mathematical expectation $ε = \sum_{j = 1}^{\infty} j p_{j}$ for both distributions. The total cost is then cU + CIε.

$(a) ε = \sum_{j = 1}^{\infty} j {(1 - p)}^{j - 1} p = \frac{1}{p}; (b) ε = \sum_{j = 0}^{\infty} (j + 1) \frac{λ^{j}}{j!} e^{- λ} = λ + 1.$

The computational cost can be reduced by rearranging elements of {(xj, pj)}j≥1. Consider an arrangement {ji}i≥1 of the integers 1, 2, 3, . . .. Then, the sequence ${(x_{j_{i}}, p_{j_{i}})}_{i \geq 1}$ defines another discrete probability distribution equivalent to the original one.

Proposition 17.7.

The computational cost of the chop-down search method attains its minimal value iff the mass probabilities are arranged in the decreasing order

$p_{1} \geq p_{2} \geq p_{3} \geq \dots (17.7)$

Proof. Suppose that there exists another arrangement of the mass probabilities, {pj′}j≥1, for which (17.7) is violated and the expected value $ε^{'} = \sum_{j \geq 1} j p_{j}^{'}$ is minimal. There are two indices k and l, k < l, so that pl′ < pk′. Let us construct another arrangement ${{p^{″}}_{j}}_{j \geq 1}$ , which is obtained from {pj′}j≥1 by swapping pl′ and pk′, i.e., pl″ = pk′, pk″ = pl′, and pj″ = pj′ for $j \notin {k, l}$ . Let $ε^{″} = \sum_{j \geq 1} j {p^{″}}_{j}$ . We have

$ε^{'} - ε^{″} = l p_{l}^{'} + k p_{k}^{'} - l p_{l}^{″} - k p_{k}^{″} = (l - k) (p_{l}^{'} - p_{k}^{'}) > 0,$

since l < k and pl′ < pk′. Hence ε″ < ε′. We arrive at a contradiction.

According to Proposition 17.7, the mass probabilities {pj}j≥1 should be arranged in decreasing order before applying the chop-down search method. Another way to speed up calculations is to use recurrence relations for mass probabilities. This allows us to reduce the parameter cI—the cost of one cycle of the CDS method. For example, the mass probabilities of a geometric random variable X ~ Geom(p) satisfy

$ℙ (X = j + 1) = ℙ (X = j) \cdot (1 - p), j = 1, 2, ...$

For a Poisson random variable X ~ Pois(λ), we have

$ℙ (X = j) = ℙ (X = j - 1) \cdot \frac{λ}{j}, j = 1, 2, ...$

Algorithm 17.3 The Binary Search Method.

input: the mass points {xj}1≤j≤N and mass probabilities {pj}1≤j≤N

calculate F0 = 0, $F_{k} = \sum_{j = 1}^{k} p_{j}$ for 1 ≤ k ≤ N − 1, FN = 1.

generate U ← Unif(0, 1)

set L ← 0 and R ← N

repeat

set $K \leftarrow ⌊ \frac{L + R}{2} ⌋$

if FK < U then

set L ← K

else

set R ← K

end if

until R = L

return X = xK

17.3.2.3 The Binomial Search Method

The formula (17.5) of the generalized inverse CDF requires the cumulative probabilities $\sum_{j = 1}^{k} p_{j}, k \geq 1$ . Therefore, the inversion method can be speeded up if these cumulative probabilities are precalculated in advance and stored in memory. After that, to find K that satisfy (17.6), we can employ a fast search procedure such as the binary search method. Suppose that a random variable X takes on one of N distinct values x1 < x2 < · · · < xN with the respective mass probabilities p1, p2, . . . , pN , i.e., ℙ(X = xj) = pj, 1 ≤ j ≤ N. Note that a random variable with a countably infinite support can be reduced to the case with N mass points by suitable truncation of the support such that the total probability of removed mass points is very small. Suppose that N = 2r, then r = log2 N cycles are required to find K. For general N, the computation cost is proportional to $⌈ \log_{2} N ⌉$ , where $⌈ \cdot ⌉$ denotes the ceiling function.

17.3.3 Composition Methods

It is a well-known fact that a linear combination (a mixture) of CDFs, PDFs, or PMFs with positive weights summing up to one is again a CDF, PDF, or PMF, respectively. The composition method aims to represent the probability distribution of interest as a mixture of simpler-organized distributions. The sampling from a mixture distribution is a two-step procedure. First, one of the distributions which appear in the composition is selected at random; second, a sample is drawn from the distribution selected (e.g., by using an inversion algorithm). In comparison with the inverse CDF method that requires one draw from Unif(0, 1), a composition method needs at least two uniform random numbers. However, the composition method allows us to express a probability distribution in a simpler form that allows for simplifying the sampling algorithm.

17.3.3.1 Mixture of PDFs

Consider a continuous random variable X with PDF f. Suppose that f can be represented as a linear combination of m PDFs f1, . . . , fm with m positive weights w1, . . . , wm so that w1 + · · · + wm = 1:

$f (x) = \sum_{j = 1}^{m} w_{j} f_{j} (x), x \in ℝ .$

The PDF f is called a mixture PDF. The support of f is a union of supports of fj, 1 ≤ j ≤ m. If all pairwise intersections of supports of fj are empty sets, then such a mixture is called a stratification.

Algorithm 17.4 The Composition Sampling Method.

input: {wj}j≥1 and {fj}j≥1

generate K from the probabilities ℙ(K = j) = wj, j ≥ 1

generate X from the PDF fK

return X

Proof of Algorithm 17.4. Let us find the distribution function of X generated by the algorithm. Applying the total probability law gives

$\begin{array}{l} ℙ (X \leq x) = & \sum_{j = 1}^{m} ℙ (X \leq x; K = j) = \sum_{j = 1}^{m} ℙ (K = j) ℙ (X \leq x | K = j) = \sum_{j = 1}^{m} w_{j} \int_{- \infty}^{x} f_{j} (x) d x \\ = & \int_{- \infty}^{x} \sum_{j = 1}^{m} w_{j} f_{j} (x) d x = \int_{- \infty}^{x} f (x) d x = F_{X} (x) \end{array}$

for all x ∊ ℝ.

Example 17.6.

Develop a stratification method for sampling from the PDF

$f (x) = {\begin{array}{l} \frac{2}{3} x if 0 < x \leq 1, \\ \frac{2}{3} if 1 < x < 2, \\ 0 otherwise . \end{array}$

Solution. As is seen, f is a piecewise linear function; it is constant on [1, 2] and linear on [0, 1]. Introduce two new PDFs: f1(x) ∝ for x ∊ (0, 1] and f2(x) ∝ 1 for x ∊ (1, 2). The second function is a PDF for Unif(1, 2). Hence, $f_{2} (x) = I_{(1, 2)} (x)$ . Calculate a normalizing constant for f1 to obtain

$f_{1} (x) = \frac{x}{\int_{0}^{1} x d x} I_{(0, 1]} (x) = 2 x I_{(0, 1]} (x) .$

Now, the PDF f(x) can be decomposed as follows:

$\begin{array}{l} f (x) = & \frac{2}{3} x I_{(0, 1]} (x) + \frac{2}{3} I_{(1, 2)} (x) = \frac{1}{3} \cdot (2 x I_{(0, 1]} (x)) + \frac{2}{3} \cdot I_{(1, 2)} (x) \\ = & \frac{1}{3} f_{1} (x) + \frac{2}{3} f_{2} (x) : = w_{1} f_{1} (x) + w_{2} f_{2} (x) . \end{array}$

Sampling from f2 is easy (see Example 17.3): X = 1 + U with U ~ Unif(0, 1). To sample from f1, we apply the inverse CDF method. First, find the CDF:

$F_{1} (x) = \int_{0}^{x} f_{1} (x) d x = x^{2}, x \in [0, 1] .$

The inverse CDF is $F_{1}^{- 1} (u) = \sqrt{u}, u \in [0, 1]$ . As a result, we obtain the following sampling algorithm:

(1) Generate two independent uniform random numbers U1, U2 ← Unif(0, 1).
(2) Sample K ∊ {1, 2} as follows. If $U_{1} < \frac{1}{3}$ , then K = 1, else K = 2.
(3) Generate X from fK as follows:
1. (i) if K = 1, then set X ← 1 + U2;
2. (ii) if K = 2, then set $X \leftarrow \sqrt{U_{2}}$ .

In general, the stratification method can be used with any partition of the support of a probability distribution. Suppose that A is the support of a PDF f. Let

$A = \cup_{j = 1}^{m} A_{k}, where i \neq j \Rightarrow A_{i} \cap A_{j} = ϕ .$

Then, the PDF f admits the following representation:

$f (x) = f (x) I_{A} (x) = f (x) \sum_{j = 1}^{m} I_{A_{j}} (x) = \sum_{j = 1}^{m} w_{j} f_{j} (x),$

where $w_{j} = \int_{A_{j}} f (x) d x$ and $f_{j} (x) = \frac{1}{ω_{j}} f (x) I_{A_{j}} (x), 1 \leq j \leq m, x \in ℝ$ .

17.3.3.2 Randomized Gamma Distributions

Probability distributions whose PDFs contain special functions such as Bessel and other hypergeometric functions present a real challenge for sampling random numbers. By replacing a special function by its integral or series representation in terms of simpler functions, the PDF can be expressed as a mixture of more regular densities with known sampling algorithms.

For example, consider the noncentral chi-square distribution with κ > 0 degrees of freedom and noncentrality parameter λ > 0. Its PDF is

$f (x; κ, λ) = \frac{1}{2} e^{- (x + λ) / 2} {(\frac{x}{λ})}^{\frac{κ}{4} - \frac{1}{2}} I_{\frac{κ}{2} - 1} (\sqrt{λ x}), x > 0. (17.8)$

The modified Bessel function Iμ of the first kind (of order μ) admits the following series expansion:

$I_{μ} (x) = {(\frac{x}{2})}^{μ} \sum_{j = 0}^{\infty} \frac{{(x^{2} / 4)}^{j}}{j! Γ (μ + j + 1)} .$

By using this expansion for $I_{\frac{κ}{2} - 1}$ in (17.8), we can represent the noncentral chi-square PDF as a mixture of gamma densities with Poisson weights:

$\begin{array}{l} f (x; κ, λ) = \frac{1}{2} e^{- (x + λ) / 2} {(\frac{x}{λ})}^{\frac{κ}{4} - \frac{1}{2}} {(\frac{\sqrt{λ x}}{2})}^{\frac{κ}{2} - 1} \sum_{j = 0}^{\infty} \frac{{(λ x / 4)}^{j}}{j! Γ (\frac{κ}{2} + j)} \\ = \sum_{j = 0}^{\infty} \underset{= p_{j} (poisson prob .)}{\underset{︸}{e^{- λ / 2 \frac{{(λ / 2)}^{j}}{j!}}}} \underset{= f_{j} (x) (gamma density)}{\underset{︸}{\frac{1}{2} {(\frac{x}{2})}^{κ / 2 + j - 1} \frac{e^{- x / 2}}{Γ (\frac{κ}{2} + j)} .}} \end{array} (17.9)$

Recall that a random variable X is said to be gamma-distributed with shape parameter α and scale parameter θ, denoted by Gamma(α, θ), if its PDF is

$f_{X} (x) = \frac{θ}{Γ (α)} {(θ x)}^{α - 1} e^{- θ x}, x > 0. (17.10)$

In (17.9), the PDF f(x; κ, λ) is expressed as a mixture $\sum_{j = 0}^{\infty} p_{j} f_{j} (x)$ , where {pj}j≥0 are mass probabilities of the Poisson distribution with intensity $\frac{λ}{2}$ and fj is a gamma PDF of the form (17.10) with parameters $α = \frac{κ}{2} + j$ and $θ = \frac{1}{2}$ for all j = 0, 1, 2, . . . . Such a mixture probability distribution is called a randomized gamma distribution, denoted $G a m m a (Y_{1} + \frac{κ}{2}, \frac{1}{2})$ , where $Y_{1} ~ P o i s (\frac{λ}{2})$ .

In general, we can consider a mixture gamma distribution Gamma(Y + ν + 1, θ) with parameters ν > −1 and θ > 0, where the randomizer Y is a discrete random variable taking its values in the set of nonnegative integers with probabilities ℙ(Y = j) = pj, j = 0, 1, 2, . . . . The PDF f of such a randomized gamma distribution admits the form of a series expansion:

$f (y) = \sum_{j = 0}^{\infty} p_{j} \frac{θ}{Γ (v + j + 1)} {(θ y)}^{v + j} e^{- θ y} .$

Let us consider three choices for the randomizer Y. The resulting probability distributions are called the randomized gamma distribution of the first, second, and third types, respectively.

Let Y1 ~ Pois(α) be a Poisson random variable with mean α > 0. The randomized gamma distribution of the first type is Gamma(Y1 + ν + 1, θ) with the PDF

$f_{1} (y) = θ {(\frac{θ y}{α})}^{v / 2} e^{- α - θ y} I_{v} (2 \sqrt{α θ y}), y > 0. (17.11)$

So the noncentral chi-square distribution with parameters κ > 0 and λ > 0 is the randomized gamma distribution of the first type with $ν = \frac{κ}{2} - 1, θ = \frac{1}{2}$ , and $α = \frac{λ}{2}$ .

A discrete random variable Y2 is said to have the Bessel probability distribution, denoted Bes(ν, b), with parameters ν > −1 and b > 0 if

$ℙ (Y_{2} = j) = \frac{{(b / 2)}^{2 j + v}}{I_{v} (b) j! Γ (j + v + 1)}, j = 0, 1, 2, ... (17.12)$

This distribution is related to many other distributions, where the modified Bessel function I is involved in the density, including the squared Bessel bridge distribution. The randomized gamma distribution of the second type is a mixture distribution Gamma(Y1 + 2Y2 + ν + 1; θ), where $Y_{1} ~ P o i s (\frac{a + b}{4 θ})$ and $Y_{2} ~ B e s (ν, \frac{\sqrt{a b}}{2 θ})$ are independent Poisson and Bessel random variables, respectively. For any positive numbers θ, a, b, and ν > −1, the PDF is

$f_{2} (y) = θ e^{- θ y - (a + b) / (4 θ)} \frac{I_{v} (\sqrt{a y}) I_{v} (\sqrt{b y})}{I_{v} (\sqrt{a b} / (2 θ))}, y > 0. (17.13)$

A discrete random variate Y3 is said to follow an incomplete gamma probability distribution, which we simply denote by IΓ(ν, λ) with parameters λ > 0 and ν > 0, if

$ℙ (Y_{3} = j) = \frac{e^{- λ} λ^{j + v}}{Γ (j + v + 1)} \frac{Γ (v)}{γ (v, λ)}, j = 0, 1, 2, ..., (17.14)$

where $γ (a, x) : = \int_{0}^{x} t^{a - 1} e - t d t$ is the lower incomplete gamma function. Note that if ν is a nonnegative integer, then the distribution of Y3 is simply a truncated and shifted Poisson distribution thanks to the property

$\frac{γ (m, a)}{Γ (m)} = 1 - (1 + x + \dots + \frac{x^{m - 1}}{(m - 1)!}) e^{- x}, m = 0, 1, 2, ...$

We call a mixture probability distribution Gamma(Y3 + 1, θ), Y3 ~ IΓ(ν, λ), the randomized gamma distribution of the third type. The PDF is

$f_{3} (y) = \frac{θ Γ (v)}{γ (v, λ)} {(\frac{θ y}{λ})}^{- v / 2} e^{- λ - θ y} I_{v} (\sqrt{4 λ θ y}), y > 0. (17.15)$

As we will see in the next chapter, randomized gamma distributions play a significant role in simulation of the so-called constant elasticity of variance diffusion model (see also Makarov and Glew (2010)).

17.3.3.3 The Alias Method by Walker

Let us study the case with discrete random variables. A mixture of PMFs is defined in the same manner as a mixture of PDFs. Consider m PMFs pj(x) and m weights wj > 0, 1 ≤ j ≤ m, such that w1 + w2 + · · · + wm = 1. The function p defined by $p (x) = \sum_{j = 1}^{m} w_{j} p_{j} (x)$ , x ∊ ℝ, is also a PMF called the mixture of the PMFs pj, 1 ≤ j ≤ m. To sample from p, we can apply Algorithm 17.4, where the sampling from the PDF fK is replaced by sampling from the PMF pK.

The alias method proposed by Walker (1977) allows us to represent any discrete probability distribution with m mass points $S : = {x_{1}, x_{2}, ..., x_{m}}$ as an equally weighted mixture of m two-point distributions. That is, there exist m two-point PMFs

$p_{j} (x) = q_{j} I_{{x_{j}}} (x) + (1 - q_{j}) I_{{a_{j}}} (x) with a_{j} \in S and q_{j} \in [0, 1], 1 \leq j \leq m,$

such that

$p (x) = \frac{1}{m} \sum_{j = 1}^{m} p_{j} (x) for x \in ℝ . (17.16)$

Since all weights in (17.16) are equal to $\frac{1}{m}$ , Algorithm 17.4 is simplified and we obtain Algorithm 17.5.

Algorithm 17.5 The Alias Sampling Method.

input: m, {xj}1≤j≤m, {aj}1≤j≤m, and {qj}1≤j≤m

generate i.i.d. U1, U2 ← Unif(0, 1)

set $K \leftarrow ⌊ m \cdot U_{1} + 1 ⌋$

if U2 ≤ qK then

set X = xK

else

set X = aK

end if

return X

To obtain such a decomposition of the PMF p, we need to construct two lists, namely, the list of probabilities Q = (q1, q2, . . . , qm) and the list of aliases A = (a1, a2, . . . , am). This can be achieved by using the “leveling the histogram” procedure, which is described below. During this procedure the original histogram {wj = pj : 1 ≤ j ≤ m} is transformed into an equally weighed histogram ${w_{j} = \frac{1}{m} : 1 \leq j \leq m}$ ; the lists Q and A are generated in the course of this process.

Step 1: Start with wj = pj, qj = 1, and aj = xj for all j = 1, 2, . . . , m.

Step 2: Find two indices ℓ, u ∊ {1, 2, . . . , m} (ℓ and u stand for “lower” and “upper,” respectively) such that

$w_{ℓ} = \min_{1 \leq j \leq m} {w_{j}} and w_{u} = \max_{1 \leq j \leq m} {w_{j}} .$

Step 3: If $w_{ℓ} = w_{u} = \frac{1}{m}$ , then the histogram is levelled and the algorithm is stopped; otherwise (i.e., $w_{ℓ} < \frac{1}{m} < w_{u}$ ) we proceed with Step 4

Step 4: Set qℓ ← N wℓ and aℓ ← xu. Change $w_{u} \leftarrow w_{u} - (\frac{1}{m} - w_{ℓ})$ and $w_{ℓ} \leftarrow \frac{1}{m}$ . Note that wu can become less than $\frac{1}{m}$ . Go back to Step 2.

As is seen, the alias method requires at most m − 1 iterations until all columns of the histogram (i.e., the weights wj, 1 ≤ j ≤ m) become the same height.

Example 17.7.

Apply the alias method to the probability distribution with

$p (x) = 0.1 I_{{1}} (x) + 0.2 I_{{2}} (x) + 0.3 I_{{3}} (x) + 0.4 I_{{4}} (x) .$

Solution. This is a four-point distribution with the list of mass points (x1, x2, x3, x4) = (1, 2, 3, 4) and the list of mass probabilities (p1, p2, p3, p4) = (0.1, 0.2, 0.3, 0.4). Initially, we set (a1, a2, a3, a4) = (1, 2, 3, 4) and (w1, w2, w3, w4) = (0.1, 0.2, 0.3, 0.4); all qi are equal to 1.

(i) The smallest weight is w1 = 0.1; the largest is w4 = 0.4. Set

$\begin{array}{l} w_{4} \leftarrow w_{4} - (0.25 - w_{1}) = 0.4 - (0.25 - 0.1) = 0.25, q_{1} \leftarrow 4 w_{1} = 0.4, \\ w_{1} \leftarrow 0.25, a_{1} \leftarrow x_{4} = 4. \end{array}$
(ii) Now, (w1, w2, w3, w4) = (0.25, 0.2, 0.3, 0.25). The smallest weight is w2 = 0.2; the largest is w3 = 0.3. Set

$\begin{array}{l} w_{3} \leftarrow w_{3} - (0.25 - w_{1}) = 0.3 - (0.25 - 0.2) = 0.25, q_{2} \leftarrow 4 w_{2} = 0.8, \\ w_{2} \leftarrow 0.25, a_{1} \leftarrow x_{3} = 3. \end{array}$
(iii) Now, (w1, w2, w3, w4) = (0.25, 0.25, 0.25, 0.25). All weights are equal to 0.25. Stop the algorithm.

As a result, we obtained the following two lists:

$(a_{1}, a_{2}, a_{3}, a_{4}) = (4, 3, 3, 4) and (q_{1}, q_{2}, q_{3}, q_{4}) = (0.4, 0.8, 1, 1) .$

17.3.4 Acceptance-Rejection Methods

To generate realizations from some probability distribution, an acceptance-rejection method makes use of realizations of another random variable whose probability distribution is similar to the target one. The distribution from which the independent samples are generated is called a proposal distribution. Each sample can be accepted or rejected. The realizations being accepted have the target probability distribution. The computational cost of such a method is proportional to the average number of draws generated before one is accepted. Since the number of trials (draws) before the first success (acceptance of a draw) follows the geometric distribution, the average number of draws generated before an acceptance occurs is a reciprocal of the probability of the acceptance of a proposal draw.

Example 17.8.

Propose an acceptance-rejection method of sampling a point uniformly distributed in the circle C = {(x, y) : x2 + y2 ≤ 1}.

Solution. The circle C is contained in the square S = [−1, 1]2. To obtain a draw from the uniform distribution Unif(C), proceed as follows.

(1) Sample a random point (X, Y) uniformly distributed in S as follows: X = 2U1 − 1 and Y = 2U2 − 1, where U1 and U2 are i.i.d. Unif(0, 1)-distributed random variables.
(2) Accept the point if (X, Y) ∊ C, i.e., X2 + Y2 ≤ 1. Otherwise, the point is rejected and we return to (1).

The formal justification of this example and of the general acceptance-rejection method is based on Propositions 17.8–17.10, which follow.

Proposition 17.8.

Let a random vector X be uniformly distributed in a domain D ⊂ ℝd of a finite (d-dimensional) volume |D| < ∞. Let Ω be a subdomain of D. The distribution of X conditional on X ∊ Ω is uniform in Ω.

Proof. Let dΩ be an arbitrary subdomain of Ω. Then,

$ℙ (X \in d Ω | X \in Ω) = \frac{ℙ (X \in d Ω; X \in Ω)}{ℙ (X \in Ω)} = \frac{ℙ (X \in d Ω)}{ℙ (X \in Ω)} = \frac{| d Ω | / | D |}{| Ω | / | D |} = \frac{| d Ω |}{| Ω |} .$

Thus, the assertion is proved.

In Example 17.8 we deal with a case covered by Proposition 17.8. Proposed points are sampled uniformly on the square S. A point (X, Y) is accepted if it lies in the circle C. According to Proposition 17.8, the probability distribution of (X, Y) conditional on (X, Y) ∊ C is uniform in C. As a result, accepted points are uniformly distributed in C.

The simplest acceptance-rejection algorithm is the so-called Neumann method. Consider a bounded PDF f ≤ M with a support contained in a finite interval [a, b]. The plot of f on [a, b] is contained in the rectangle [a, b] × [0, M]. To sample from f we proceed as follows. First, sample (X, Y) uniformly in the rectangle [a, b] × [0, M]. This point is accepted if Y ≤ f(X) and rejected otherwise. According to Proposition 17.8, the distribution of accepted points is uniform in the region bounded by the plot of y = f(x) and the x-axis. Moreover, we can show that the distribution of the x-coordinate of an accepted point has the PDF f. That is, X conditional on Y ≤ f(X) has the distribution with the PDF f. This result will be proved in Proposition 17.9 in more general setting.

Definition 17.1.

Consider a nonnegative integrable function g with support D ⊆ ℝd, that is, g(x) ≥ 0 for x ∊ D, g(x) = 0 for x ∉ D, and $I_{D} (g) : = \int_{ℝ^{n}} g (x) d x = \int_{D} g (x) d x < \infty$ hold. The region

$B_{D} (g) : = {(x, y) : x \in D, 0 < y < g (x)} \subseteq ℝ^{d + 1}$

is called the body of the function g.

Proposition 17.9.

Suppose that the (d + 1)-dimensional point (X, Y), with X ∊ ℝd and Y ∊ ℝ, is uniformly distributed in the body BD(g) of an integrable function g : D → [0, ∞) defined on D ⊆ ℝd. Then, the random vector X is distributed in D with the PDF proportional to g:

$p x (x) = \frac{1}{I_{D} (g)} g (x), x \in D .$

Proof. The joint PDF of (X, Y) is

$f_{X, Y} (x, y) = \frac{1}{| B_{D} (g) |} I_{B_{D} (g)} (x, y) = \frac{1}{| B_{D} (g) |} I_{D} (x) I_{(0, g (x))} (y) .$

The marginal PDF of X is

$p_{X} (x) = \int_{- \infty}^{\infty} f_{X, Y} (x, y) d y = \int_{0}^{g (x)} \frac{1}{| B_{D} (g) |} I_{D} (x) d y = \frac{1}{I_{D} (g)} g (x) I_{D} (x),$

since ID(g) = |BD(g)|.

In the Neumann method the proposed sample values are drawn from the uniform PDF $p (x) = \frac{1}{b - a} I_{(a, b)} (x)$ . This choice is explained by the fact that the PDF p majorizes the target PDF f up to a multiplicative constant: f(x) ≤ C p(x) with C = M(b − a) (provided that f(x) ≤ M for all x ∊ [a, b]). A draw X is accepted if Y ≤ f(X) where Y ~ Unif(0, M). Therefore, the Neumann method can be generalized to the case with an arbitrary PDF f as long as we can find a majorizing function for f.

In what follows, we will require the following proposition that explains how to sample a point uniformly distributed in a body of a nonnegative function.

Proposition 17.10.

Let p be a multivariate PDF with support D. Suppose that X ~ p and (Y |X = x) ~ Unif(0, C p(x)) for some constant C > 0. Then, the point (X, Y) is uniformly distributed in the domain BD(C p).

Proof. The joint PDF of (X, Y) is

$\begin{matrix} f_{X, Y} (x,y) = f_{X} (x) f_{Y | X} (y | x) = p (x) \frac{1}{C p (x)} I_{(0, C p (x))} (y) \\ = \frac{1}{C} I_{D} (x) I_{(0, C p (x))} (y) = \frac{1}{| B_{D} (C p) |} I_{B_{D} (C p)} (x), \end{matrix}$

since $| B_{D} (C_{P}) | = \int_{D} C_{p} (x) d x = C \int_{D} p (x) d x = C .$ .

Consider an n-variate PDF f with support D ⊆ ℝn. Suppose that there exists another PDF p called a proposal PDF and a constant C > 0 such that f(x) ≤ C p(x) for all x ∊ ℝn. Often p is chosen to be a simple function like a piecewise-linear function so that sampling from p is feasible.

Algorithm 17.6 The Acceptance-Rejection Method (Version 1).

(1) Sample X from the proposal PDF p.
(2) Generate U ~ Unif(0, 1) independent of X.
(3) Accept X if $U \leq \frac{f (x)}{C_{p} (x)}$ . Otherwise return to (1).

The acceptance-rejection method can be visualized as choosing a subsequence of draws from a sequence of i.i.d. realizations from the PDF p in such a way that the resulting subsequence consists of i.i.d. realizations from the target PDF f.

i.i.d. draws from p	${\tilde{X}}_{1}$	${\tilde{X}}_{2}$	${\tilde{X}}_{3}$	${\tilde{X}}_{4}$	${\tilde{X}}_{5}$	${\tilde{X}}_{6}$	. . .
Accept?	no	yes	no	no	yes	yes	. . .
i.i.d. draws from f		X1			X2	X2	. . .

The proof of the acceptance-rejection method is based on Propositions 17.8–17.10. The steps of Algorithm 17.6 can be reformulated as follows. First, sample two random variables X ~ p and Y ~ Unif(0, C p(X)). As is proved in Proposition 17.10, the point (X, Y) is uniformly distributed in BD(C p). If (X, Y) ∊ BD(f), then this point is accepted. In accordance with Proposition 17.8, the accepted point is uniformly distributed in BD(f). Finally, as follows from Proposition 17.9, the X-coordinate is distributed with the PDF f.

While justifying Algorithm 17.6, we did not use the fact that f is a normalized PDF. The acceptance-rejection method is often applied to complicated multivariate densities only known up to a multiplicative constant. Thus, the following generalization of Algorithm 17.6 is quite useful in dealing with such cases. Consider the sampling from a PDF proportional to some nonnegative integrable function f. Suppose that there exists another integrable function g so that it majorizes f, i.e., f(x) ≤ g(x) for all x. Let f and g have the same support D. The sampling algorithm is as follows.

Algorithm 17.7 The Acceptance-Rejection Method (Version 2).

(1) Sample X from the PDF p ∝ g, that is, $p (x) = {(\int_{ℝ^{n}} g (x) d x)}^{- 1} g (x)$ .
(2) Generate U ~ Unif(0, 1) independent of X.
(3) If $U < \frac{f (X)}{g (X)}$ , then accept X, otherwise return to step 1.

A proposed draw X is accepted if the point (X, Y) being sampled uniformly in the body of g belongs to the body of f. Therefore, the probability of accepting X equals the ratio of the volumes of BD(f) and BD(g):

$ℙ (Accept) = \frac{| B_{D} (f) |}{| B_{D} (g) |} = \frac{\int_{D} f (x) dx}{\int_{D} g (x) dx} .$

To maximize this probability, we need to choose g as close to f as possible (see Figure 17.5). The average number of trials per one accepted draw (the computational cost of the acceptance-rejection method) is

$Cost = ℙ {(Accept)}^{- 1} = \frac{\int_{D} g (x) dx}{\int_{D} f (x) dx} .$

Figure 17.5

Figure showing the acceptance-rejection method.

The acceptance-rejection method.

Example 17.9.

Develop an acceptance-rejection method for the PDF

$f (x) = \frac{2 arcin x}{π - 2}, 0 < x < 1.$

Solution. By using the property that arcsin $x \leq \frac{π}{2} x for 0 < x < 1$ , we obtain that

$f (x) \leq g (x) : = \frac{π}{π - 2} x for 0 < x < 1.$

The ratio of f and g is $\frac{\arcsin x}{(π / 2) x}$ . The proposal PDF p ∝ g is given by $p (x) = 2 x I_{(0, 1)} (x)$ . The sampling algorithm is as follows.

(1) Generate two i.i.d. U1, U2 ~ Unif(0, 1).
(2) Sample X ~ p by using the inverse CDF method: $X = \sqrt{U_{1}}$ .
(3) Accept X if $U_{2} \leq \frac{\arcsin X}{(π / 2) X}$ . Otherwise, return to (1).

Example 17.10.

Develop an acceptance-rejection method for the standard normal distribution using the double-exponential sampling distribution as a proposal one. Find the computational cost.

Solution. The standard normal PDF n is proportional to $f (x) = e^{- x^{2} / 2}$ . We have the following upper bound for f:

$\exp (- \frac{x^{2}}{2}) = \exp (- \frac{x^{2} - 2 | x | + 1}{2} - | x | + \frac{1}{2}) = \exp (- \frac{{(| x | - 1)}^{2}}{2}) \sqrt{e} e^{- | x |} \leq \sqrt{e} e^{- | x |} .$

So the majorizing function is $g (x) = \sqrt{e} e^{- | x |}$ . Therefore, the proposal probability distribution is the double-exponential distribution with the PDF $p (x) = \frac{1}{2} e^{- | x |}$ , which can be expressed as a mixture:

$p (x) = \frac{1}{2} e^{- x} I_{[0, \infty)} (x) + \frac{1}{2} e^{- | x |} I_{[- \infty, 0)} (x) .$

To sample from p, we first obtain a draw from Exp(1) and then assign a random sign to it:

$X = {\begin{array}{l} Y with probability \frac{1}{2}, \\ - Y with probability \frac{1}{2}, \end{array} where Y ~ E xp (1) .$

As a result, we obtain the following algorithm.

(1) Generate three i.i.d. U1, U2, U1 ~ Unif(0, 1).
(2) Sample X ~ p by using the composition method: $X = sgn (U_{1} - \frac{1}{2}) \ln U_{2}$ .
(3) Accept X if $U_{3} \leq \frac{f (X)}{g (X)} = \exp (- \frac{{(| X | - 1)}^{2}}{2})$ . Otherwise, return to (1).

The probability of acceptance PA is

$P_{A} = \frac{\int_{- \infty}^{\infty} e^{- x^{2} / 2} d x}{\sqrt{e} \int_{- \infty}^{\infty} e^{- | x |} d x} = \frac{\sqrt{2 π}}{2 \sqrt{e}} = \sqrt{\frac{π}{2 e}} ≅ 0.7602.$

Therefore, the computational cost is $E [# trials per acceptance] = \frac{1}{P_{A}} ≅ 1.3155$ .

17.3.5 Multivariate Sampling

17.3.5.1 Sampling by Conditioning

A d-variate joint PDF fX of the random vector X = [X1, X2, . . . , Xd]⊤ can be represented as a product of univariate conditional densities:

$f_{X} (X) = f_{X_{1}} (x_{1}) f_{X_{2} | X_{1}} (x_{2} | x_{1}) \dots f_{X_{d} | X_{1}, ..., X_{d - 1}} (x_{d} | x_{1}, ..., x_{d - 1}),$

where x = [x1, x2, . . . , xd]⊤. The sampling procedure is as follows:

Step 1. Generate $X_{1} ~ f_{X_{1}}$ .

Step 2. Generate X2 conditional on X1 from $f_{X_{2} | X_{1}}$ .

⋮

Step d. Generate Xd conditional on X1, . . . , Xd−1 from $f_{X_{d} | X_{1, ...,} X_{d - 1}}$ .

This method is simplified if the components Xj, 1 ≤ j ≤ d, are independent random variables. The joint PDF is then a product of marginal PDFs:

$f_{X} (X) = \prod_{j = 1}^{d} f_{X_{j}} (x_{j}) .$

Example 17.11.

Construct a sampling algorithm for a random vector X uniformly distributed in a hyperparallelepiped $D = \prod_{j = 1}^{d} (a_{j}, b_{j}), a_{j} < b_{j}, 1 \leq j \leq d$ .

Solution. The joint PDF is a product of d marginal uniform densities:

$\begin{array}{l} f_{X} (X) = \frac{1}{| D |} I_{D} (X) = \frac{1}{(b_{1} - a_{1}) \dots (b_{d} - a_{d})} I_{(a_{1}, b_{1})} (x_{1}) \dots I_{(a_{d}, b_{d})} (x_{d}) \\ = \prod_{j = 1}^{d} \frac{1}{b_{j} - a_{j}} I_{(a_{j}, b_{j})} (x_{j}) \equiv \prod_{j = 1}^{d} f_{X_{j}} (x_{j}) . \end{array}$

Therefore, the vector X is formed of d i.i.d. uniformly distributed random variables:

$X_{j} = a_{j} + (b_{j} - a_{j}) U_{j}, U_{j} ~ U n i f (0, 1) 1 \leq j \leq d .$

17.3.5.2 The Box–Müller method

A pair of independent standard normal random variables Z1 and Z2 can be generated from two independent Unif(0, 1)-distributed random variables by using the following steps.

(1) Define the random variables R and Θ implicitly by

$Z_{1} = R \cos Θ and Z_{2} = R \sin Θ . (17.17)$
(2) One can show that R and Θ are independent random variables. Moreover, they can be simulated by the following formulae:

$R = \sqrt{- 2 \ln U_{1}} and Θ = 2 π U_{2}, (17.18)$

where U1 and U2 are independent Unif(0, 1)-distributed random variables.
(3) Therefore, Z1 and Z2 can be expressed in terms of U1 and U2 as follows:

$Z_{1} = \sqrt{- 2 \ln U_{1}} \cos (2 π U_{2}) and Z_{2} = \sqrt{- 2 \ln U_{1}} \sin (2 π U_{2}) . (17.19)$

To justify the Box–Müller method we apply the following theorem.

Theorem 17.11

(Bivariate Transformation Theorem, e.g., See Gut (2009)). Consider X and Y —jointly continuous random variables, and a one-to-one bivariate continuously differentiable transformation defined on the support of (X, Y) by u = g(x, y) and v = h(x, y). The joint PDF of U := g(X, Y) and V := h(X, Y) is $f_{U, V} (u, υ) = \frac{1}{| J (x, y) |} f_{X, Y} (x, y)$ , where (x, y) is a unique solution to ${\begin{array}{l} g (x, y) = u \\ h (x, y) = v \end{array}$ and J(x, y) is the Jacobian determinant of the transformation defined by

$J (x, y) = det [\begin{array}{l} \frac{\partial g}{\partial x} (x, y) & \frac{\partial g}{\partial y} (x, y) \\ \frac{\partial h}{\partial x} (x, y) & \frac{\partial h}{\partial y} (x, y) \end{array}] .$

Since Z1 and Z2 are independent standard normal random variables, their joint PDF is

$f_{Z_{1}, Z_{1}} (z_{1}, z_{2}) = n (z_{1}) n (z_{2}) = \frac{1}{2 π} e^{- \frac{1}{2} (z_{1}^{2} + z_{2}^{2})} .$

The bivariate transformation theorem allows us to obtain a joint PDF of the pair (R, Θ). The Jacobian determinant of the transformation in (17.17) is equal to r. Thus,

$f_{Z_{1}, Z_{2}} (r \cos θ, r \sin θ) = \frac{1}{r} f_{R, Θ} (r, θ) \Rightarrow f_{R, Θ} (r, θ) = r e^{- r^{2} / 2} \frac{1}{2 π},$

for r > 0 and 0 < θ < 2π. The joint PDF of R and Θ is a product of the marginal PDFs $f_{R} (r) = r e^{- r^{2} / 2} I_{[0, \infty)} (r)$ and $f_{Θ} (θ) = \frac{1}{2 π} I_{[0, 2 π)} (θ)$ . Therefore, R and Θ are independent random variable. Let us apply the inverse CDF method to generate R:

$F_{R} (r) = 1 - e^{- r^{2} / 2} \Rightarrow R = F_{R}^{- 1} (1 - U_{1}) = \sqrt{- 2 \ln U_{1},}$

where U1 ~ Unif(0, 1). Moreover, Θ ~ Unif(0, 2π), hence Θ = 2πU2 with U2 ~ Unif(0, 1). So, (17.18) is proved and (17.19) follows.

17.3.5.3 Simulation of Multivariate Normals

Consider a multivariate normal vector X = [X1, X2, . . . , Xd]⊤ ~ Normd(μ, Σ). If the covariance matrix Σ is a diagonal matrix diag $(σ_{1}^{2}, . . ., σ_{d}^{2})$ , then Xj, 1 ≤ j ≤ d, are all independent normals which can be expressed in terms of independent standard normal variables as follows:

$X_{j} = μ_{j} + σ_{j} Z_{j}, Z_{j} ~ N o r m (0, 1), 1 \leq j \leq d .$

Independent standard normals can be generated by the Box–Müller method, by the acceptance-rejection method, or by the inverse CDF method. In the latter case we set $Z = N^{- 1} (U)$ with U ~ Unif(0, 1), where the inverse normal CDF N−1(x) can be calculated numerically. One interesting application of standard normal random variables is the sampling of an isotropic vector in d dimensions.

(1) Sample i.i.d. Z1, . . . , Zd ~ Norm(0, 1)
(2) Define the vector X = [X1, X2, . . . , Xd]⊤ by Xj = Zj/Rd, 1 ≤ j ≤ d, where $R_{d}^{2} = Z_{1}^{2} + \cdot \cdot \cdot + Z_{d}^{2}$ . [Note that $R_{d}^{2}$ is a chi-square random variable with d degrees of freedom.]
(3) As a result, X is uniformly distributed on a unit d-dimensional sphere.

Suppose that the covariance matrix Σ has nonzero off-diagonal elements. Consider the following two general methods of sampling X:

(1) using the Cholesky factorization of the covariance matrix;
(2) using the conditional normal distribution.

Sampling by the Cholesky Factorization.

Let L be a lower-triangular matrix from the Cholesky factorization of Σ, i.e., Σ = L L⊤.

Let Z be a d-dimensional vector formed by i.i.d. standard normals Zj ~ Norm(0, 1), j = 1, 2, . . . , d. Then, we set

$X : = μ + L Z ~ N o r m_{d} (μ, Σ) .$

Conditional Normal.

Let us split the vector X into two parts:

$X = [\begin{matrix} X_{1} \\ X_{2} \end{matrix}], where X_{1} \in ℝ^{m} and X_{2} \in ℝ^{d - m}$

for some 1 ≤ m < d. Split also the vector μ and matrix Σ to represent them in block form:

$μ = [\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}] and Σ = [\begin{matrix} Σ_{11} & Σ_{12} \\ Σ_{21} & Σ_{22} \end{matrix}],$

where μ1 ∊ ℝm and μ2 ∊ ℝd−m are vectors; Σ11 ∊ ℝm×m, Σ12 ∊ ℝm×(d−m), Σ21 ∊ ℝ(d−m)×m, and Σ22 ∊ ℝ(d−m)×(d−m) are matrices. Then, the conditional distribution of X1 given the value of X2 is normal:

$X_{1} | {X_{2} = x_{2}} ~ N o r m_{m} (μ_{1} + Σ_{12} Σ_{22}^{- 1} (X_{2} - μ_{2}), Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21}) . (17.20)$

Example 17.12.

Construct two methods of sampling from the trivariate normal distribution

$N o r m_{3} ([\begin{matrix} 3 \\ 2 \\ 4 \end{matrix}], [\begin{array}{l} 9 & 0 & 0 \\ 0 & 4 & 2 \\ 0 & 2 & 3 \end{array}]),$

using the Cholesky factorization and the conditional sampling approach, respectively.

Solution.

Find the Cholesky factorization of the covariance matrix. Let us solve the matrix equation Σ = L L⊤ to find L:

$[\begin{array}{l} 9 & 0 & 0 \\ 0 & 4 & 2 \\ 0 & 2 & 3 \end{array}] = [\begin{array}{l} ℓ_{11} & 0 & 0 \\ ℓ_{21} & ℓ_{22} & 0 \\ ℓ_{31} & ℓ_{32} & ℓ_{33} \end{array}] [\begin{array}{l} ℓ_{11} & ℓ_{21} & ℓ_{31} \\ 0 & ℓ_{22} & ℓ_{32} \\ 0 & 0 & ℓ_{33} \end{array}] \Rightarrow L = [\begin{array}{l} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 1 & \sqrt{2} \end{array}] .$

Thus, we obtain the following sampling formulae:

$[\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \end{matrix}] = [\begin{matrix} 3 \\ 2 \\ 4 \end{matrix}] + [\begin{array}{l} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 1 & \sqrt{2} \end{array}] [\begin{matrix} Z_{1} \\ Z_{2} \\ Z_{3} \end{matrix}] \Rightarrow {\begin{array}{l} X_{1} = 3 + 3 Z_{1}, \\ X_{2} = 2 + 2 Z_{2}, \\ X_{3} = 4 + Z_{2} + \sqrt{2} Z_{3}, \end{array}$

where Z1, Z2, Z3 are i.i.d. standard normals.
First, sample X1 ~ Norm(3, 9): X1 = 3 + 3Z1. Second, sample X2 conditional on X1. Recall that two normal variables are independent iff they are uncorrelated. Hence X2 is independent of X1 since Cov(X1, X2) = 0. Therefore, $(X_{2} | X_{1}) \overset{d}{=} X_{2} ~ N o r m (2, 4)$ and X2 = 2 + 2Z2. Third, sample X3 conditional on X1 and X2. Again, X3 and X1 are uncorrelated, hence $(X_{3} | X_{1}, X_{2}) \overset{d}{=} (X_{3} | X_{2})$ . We have

$[\begin{matrix} X_{3} \\ X_{2} \end{matrix}] ~ N o r m_{2} ([\begin{matrix} 4 \\ 2 \end{matrix}], [\begin{matrix} 3 & 2 \\ 2 & 4 \end{matrix}]) \Rightarrow X_{3} | X_{2} ~ N o r m (3 + \frac{X_{2}}{2}, 2) .$

Thus, we have $X_{3} = 3 + \frac{X_{2}}{2} + \sqrt{2} Z_{3} = 4 + Z_{2} + \sqrt{2} Z_{3}$ .

17.4 Simulation of Random Processes

A typical problem that requires simulation of sample paths of a stochastic process {X(t)}t≥0 is the estimation of a mathematical expectation of the form

$E [g ({X (t) : 0 \leq t \leq T})] (17.21)$

with some function g of an X-path. There are several possible cases.

The function g depends on a discretely monitored skeleton of the process X:

$g = g (X (t_{1}), X (t_{2}) ..., X (t_{m})), 0 \leq t_{1} < t_{2} ... < t_{m} \leq T .$

One special case is when g = g(X(T)). For example, the estimation of E[g(X(T))] is required to price a European-style option.
The function g depends on path-dependent quantities such as the running maximum/minimum of the process and the first passage time. It may be possible to sample such path-dependent quantities directly from their distributions rather than calculate them from a sample path.
The function g depends on a full sample path of process X on [0, T]. Since it may be not feasible to generate a complete sample path of a continuous-time process (unless we deal with a Poisson process or a similar process with piecewise paths that changes at a finite number of time points), such a full path can only be obtained by applying an interpolation algorithm to a path skeleton.

So our goal is to sample a path skeleton

$X (t_{1}), X (t_{2}), ..., X (t_{m}) for 0 \leq t_{1} < t_{2} < \dots < t_{m} \leq T .$

The skeleton can be generated from its exact multivariate distribution. In this case, the problem (17.21) can be reduced to the estimation of a multivariate integral of the form

$\int_{R^{m}} g (x_{1}, x_{2}, ..., x_{m}) f_{X (t_{1}), X (t_{2}), ..., X (t_{m})} (x_{1}, x_{2}, ..., x_{m}) d x_{1} d x_{2} \dots d x_{m}$

where f is a joint PDF of X(t1), X(t2), . . . , X(tm). Another approach is to sample an approximation path by applying some discretization scheme. Note that Brownian motion and other Gaussian processes as well as some jump processes can be sampled precisely form their path distributions. General diffusions can be simulated approximately by using, for example, the Euler approximation scheme.

17.4.1 Simulation of Brownian Processes

17.4.1.1 Sequential Sampling

The sequential sampling of Brownian motion (BM) and geometric Brownian motion is based on the property that Brownian increments on nonoverlapping intervals are independent. Consider a scaled Brownian motion with drift, $W_{x_{0}}^{(μ, σ)} (t) : = x_{0} + μ t + σ W (t)$ . The standard BM is recovered from the process $W_{x_{0}}^{(μ, σ)}$ if we take x0 = 0, μ = 0, and σ = 1. Suppose that the process $W_{x_{0}}^{(μ, σ)}$ is to be sampled at a set of time points 0 = t0 < t1 < t2 < · · · < tm. For all j ≥ 1, we have

$W_{x_{0}}^{(μ, σ)} (t_{j}) = x_{0} + μ t_{j} + σ W (t_{j}) = W_{x_{0}}^{(μ, σ)} (t_{j - 1}) + μ (t_{j} - t_{j - 1}) + σ (W (t_{j}) - W (t_{j - 1})) .$

Since the increment W(tj) − W(tj−1) ~ Norm(0, tj − tj−1) is independent of $W_{x_{0}}^{(μ, σ)} (t_{j - 1})$ , we obtain the following simple algorithm.

Algorithm 17.8 Sequential Simulation of a Scaled BM with Drift.

input: x0, μ, σ, 0 = t0 < t1 < t2 < · · · < tm

set $W_{x_{0}}^{(μ, σ)} (0) = x_{0}$

for j from 1 to m do

generate Zj ← Norm(0, 1)

set $W_{x_{0}}^{(μ, σ)} (t_{j}) \leftarrow W_{x_{0}}^{(μ, σ)} (t_{j - 1}) + μ (t_{j} - t_{j - 1}) + σ \sqrt{t_{j} - t_{j - 1}} Z_{j}$

end for

return ${W_{x_{0}}^{(μ, σ)} (t_{j})}_{0 \leq j \leq m}$

The sample path of a geometric Brownian motion $S (t) = S_{0} e^{μ t + σ W (t)} = e^{\ln S_{0} + μ t + σ W (t)}$ can be obtained by taking the exponential function of a sample path of the scaled BM with drift $W_{s_{0}}^{(μ, σ)} (t)$ that starts at x0 = ln S0, i.e., $S (t) = \exp (W_{\ln S_{0}}^{(μ, σ)} (t))$ .

Now we consider is a multidimensional BM $W (t) = {[W_{1} (t), W_{2} (t), ..., W_{d} (t)]}^{Τ}$ . Each component of W(t) is a standard Brownian motion. Suppose that the processes Wj, 1 ≤ j ≤ d, are correlated. For 1 ≤ i, j ≤ d, the correlation coefficient between Wi(t) and Wj(t) is

$ρ_{i j} = Corr (W_{i} (t), W_{j} (t)) = \frac{E [W_{i} (t) W_{j} (t)] - E [W_{i} (t)] E [W_{j} (t)]}{\sqrt{E [W_{i}^{2} (t)] E [W_{j}^{2} (t)]}} = \frac{E [W_{i} (t) W_{j} (t)]}{t} .$

Let R = [ρij]1≤i,j≤d be the correlation matrix. R is a positive definite matrix with ones on the main diagonal. If we deal with independent Brownian motions then R = I. Apply the Cholesky factorization to find a lower triangular matrix L so that R = L L⊤. For example, for the two-dimensional case we have

$R = [\begin{matrix} 1 & ρ_{12} \\ ρ_{12} & 1 \end{matrix}] = L L^{⊺} \Rightarrow L = [\begin{matrix} 1 & 0 \\ ρ_{12} & \sqrt{1 - ρ_{12}} \end{matrix}] .$

Algorithm 17.9 allows us to obtain a realization of W at time points t0, t1, . . . , tm with 0 = t0 < t1 < · · · < tm.

17.4.1.2 Bridge Sampling

Previously, we derived the probability distribution of Brownian motion pinned at the endpoints of a time interval. Recall that Brownian motion conditional on W (0) = a and W (T) = b is called a Brownian bridge from a to b on [0, T]. There exist several applications of the Brownian bridge. First, the bridge distribution can be used to refine a sample skeleton. Second, it can be used as an alternative to the sequential simulation method for sampling a Brownian trajectory.

Algorithm 17.9 Sequential Simulation of a Standard d-Dimensional BM.

input: L and 0 = t0 < t1 < t2 < · · · < tm

set W(0) = 0

for j from 1 to m do

generate d i.i.d. variates $Z_{t}^{j} \leftarrow Norm (0, 1)$ , 1 ≤ i ≤ d

set $W (t_{j}) \leftarrow W (t_{j - 1}) + \sqrt{t_{j} - t_{j - 1}} L Z^{j}$ , where $Z^{j} = {[Z_{1}^{j}, Z_{2}^{j}, ..., Z_{d}^{j}]}^{Τ}$

end for

return ${W (t_{j})}_{0 \leq j \leq m}$

Suppose that a standard BM is sampled at m time moments 0 = t0 < t1 < · · · < tm = T and we wish to sample W (s) at some additional time moment s ∊ (0, T) conditional on these values. Let s ∊ (tj, tj+1) for some j ∊ {0, 1, . . . , m−1}. It follows from the Markov property of BM that

$(W (s) | {W (t_{i}) = x_{i}, 0 \leq i \leq m}) \overset{d}{=} (W (s) | {W (t_{j}) = x_{j} W (t_{j + 1}) = x_{j + 1}}) .$

Hence W (s) can be sampled from the distribution of a Brownian bridge on [ti, tj+1]. By applying this procedure, a sample path can be refined without re-sampling its values at t1, t2, . . . , tm.

Consider the so-called dyadic partition of the time interval [0, T] with m = 2k points $t_{j} = \frac{j}{m} T$ , where j = 0, 1, . . . , m and k ≥ 1. Let a realization of Brownian motion be sampled as follows:

$\begin{array}{l} Step 1: & sample W (t_{m}) conditional on W (t_{0}) = 0, \\ Step 2: & sample W (t_{m / 2}) conditional on W (t_{0}), W (t_{m}), \\ Step 3 : & sample W (t_{m / 4}) conditional on W (t_{0}), W (t_{m / 2}), \\ Step 4 : & sample W (t_{3 m / 4}) conditional on W (t_{m / 2}), W (t_{m}), \\ ⋮ \\ Step m : & sample W (t_{m - 1}) conditional on W (t_{m - 2}), W (t_{m}) . \end{array}$

In other words, first we sample W(tm) and after that for each tj, 1 ≤ j ≤ m − 1, W(tj) is sampled conditionally on W (tℓ) and W(tk) previously generated, where the indices ℓ and k satisfy 0 ≤ ℓ < j < k ≤ m and $j = \frac{ℓ + k}{2}$ . As a result, a trajectory of BM is sampled at the time points in the following order of generation:

$\underset{}{\underset{︸}{t_{m}}}, \underset{}{\underset{︸}{t_{m / 2}}}, \underset{\underset{}{\underset{︸}{t_{2}, t_{6}, t_{10}, ..., t_{m - 2}}},}{\underset{︸}{t_{m / 4}, t_{3 m / 4}}}, \underset{\underset{}{\underset{︸}{t_{1}, t_{3}, ..., t_{m - 1}}}}{\underset{︸}{t_{m / 8}, t_{3 m / 8}, t_{5 m / 8}, t_{7 m / 8}, ...}} (17.22)$

Bridge sampling with m = 8 time points is illustrated in Figure 17.6.

Figure 17.6

Figure showing the bridge sampling. Here Wj denotes W(tj) for 0 ≤ j ≤ m.

The bridge sampling. Here Wj denotes W(tj) for 0 ≤ j ≤ m.

The bridge sampling algorithm is useful in pricing path-dependent financial instruments. Being applied with (randomized) low-discrepancy numbers, i.e., when the (randomized) quasi-Monte Carlo method is used, it allows us to reduce the variance of a path-dependent estimator. Another advantage is that the bridge sampling algorithm can be easily parallelized.

Algorithm 17.10 Brownian Bridge Sampling for a Dyadic Time Partition.

input: the time points $t_{j} = \frac{j}{m} T, j = 0, 1$ , . . . , m, where m = 2k

generate Z ← Norm(0, 1)

set W(t0) ← 0, $W (t_{m}) \leftarrow \sqrt{T} Z$ , and h ← T

for ℓ from 1 to k do

set h ← h/2

for j from 1 to 2ℓ−1 do

generate Z ← Norm(0, 1)

set $W (t_{(2 j - 1) 2^{k - ℓ}}) \leftarrow \frac{1}{2} (W (t_{(j - 1) 2^{k - ℓ + 1}}) + W (t_{j 2^{k - ℓ + 1}})) + \sqrt{h} Z$

end for

return {W(tj)}0≤j≤m

17.4.2 Simulation of Gaussian Processes

In accordance with the definition of a Gaussian process {X(t)}t≥0, for any sequence of time points 0 < t1 < t2 < · · · < tm, the vector X = [X(t1), X(t2), . . . , X(tm)]⊤ has a multivariate normal distribution Normm(μm, Σm) with

$μ_{m} = [\begin{matrix} m_{X} (t_{1}) \\ m_{X} (t_{2}) \\ ⋮ \\ m_{X} (t_{m}) \end{matrix}] and Σ_{m} = [\begin{array}{l} c_{X} (t_{1}, t_{1}) & c_{X} (t_{1}, t_{2}) & \dots & c_{X} (t_{1}, t_{m}) \\ c_{X} (t_{2}, t_{1}) & c_{X} (t_{2}, t_{2}) & \dots & c_{X} (t_{2}, t_{m}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{X} (t_{m}, t_{1}) & c_{X} (t_{m}, t_{2}) & \dots & c_{X} (t_{m}, t_{m}) \end{array}],$

where mX(t) = E[X(t)] and cX(t, s) = Cov(X(t), X(s)) are, respectively, the mean and covariance functions. Therefore, the realization of X at time points t1, t2, . . . , tm can be constructed by sampling from the m-variate normal distribution Normm(μm, Σm) as follows.

(1) Apply the Cholesky factorization to find a lower-triangular matrix Lm so that

$Σ_{m} = L_{m} L_{m}^{⊺} .$
(2) Sample m i.i.d. standard normals Z1, Z2, . . . , Zm.
(3) Set X = μm + Lm Z where Z = [Z1, Z2, . . . , Zm]⊤.

This generic algorithm can be applied to any Gaussian process including those listed below.

Brownian Motion $W_{x_{0}}^{(μ, σ)} (t) : = x_{0} + μ t + σ W (t)$ with m(t) = x0 + μt and c(t, s) = σ2 (tΛs) for s, t ≥ 0.

Itô Processes $X (t) : = \int_{0}^{t} μ (s) d s + \int_{0}^{t} σ (t) d W (S)$ with $m (t) = \int_{0}^{t} μ (u) d u$ and $c (t, S) = \int_{0}^{t \land S} σ^{2} (u) d u for S, t \geq 0$ .

Fractional Brownian Motion W(H)(t) with c(t, s) = (t2H + s2H − |t − s|2H)/2 and m(t) = 0. Here, H ∊ [0, 1] is the so-called Hurst parameter. Note that W(1/2) is a standard BM with c(t, s) = (t + s − |t − s|)/2 = t Λ s for s, t ≥ 0.

Standard Brownian Bridge from a to b on [0, T], denoted ${B_{[0, T]}^{a, b} (t)}_{0 \leq t \leq T}$ , with $m (t) = \frac{a (T - t) + b t}{T}$ and $c (t, s) = t \land s (T - t \lor s) for 0 \leq s, t \leq T$ .

17.4.3 Diffusion Processes: Exact Simulation Methods

A diffusion process {X(t)}t≥0 is a solution to an initial value problem for a stochastic differential equation (SDE):

${\begin{array}{l} d X (t) = μ (t, X (t)) d t + σ (t, X (t)) d W (t), t \geq 0, \\ X (0) = X_{0} . \end{array} (17.23)$

The process can also be written in the integral form

$X (t) = X_{0} + \int_{0}^{t} μ (s, X (s)) d s + \int_{0}^{t} σ (s, X (s)) d W (s) . (17.24)$

The functions μ and σ are called the drift coefficient and the diffusion coefficient, respectively.

The problem (17.23) (or (17.24)) can be solved analytically or numerically. One approach is to represent the process X as an explicit function of the underlying Brownian motion W :

$X (t) = f (t, {W (s) : 0 \leq s \leq t}) . (17.25)$

Such an explicit representation is called a strong solution to (17.23). As a result of (17.25), a sample path of the X-process is obtained by transforming a Brownian trajectory. Another approach is to find the transition PDF for X by solving the Kolmogorov equation. The strong solution and/or the transition density can be used to generate a realization of the diffusion X from its exact finite-dimensional distribution. As usual, our goal is to generate a path skeleton for an arbitrary sequence of time points. Alternatively, being unable to analytically solve (17.23), one can apply a numerical scheme to find an approximate realization of the diffusion process. The Euler scheme, which is the simplest and most popular simulation method, is considered in Section 17.4.4.

17.4.3.1 The Stochastic Calculus Approach

Most of the SDEs for which we can find a strong solution are SDEs with linear (w.r.t. the space variable) drift and diffusion coefficients. Let us consider some examples of such SDEs and derive transition probability distributions of their solutions.

Geometric Brownian motion is a solution to

$d X (t) = μ X (t) d t + σ X (t) d W (t), X (0) = X_{0} .$

Applying the Itô formula gives $X (t) = X_{0} e^{(μ - σ^{2} / 2) t + σ W (t)}$ . To model the transition X(s) → X(t) for 0 ≤ s < t, we use the following representation:

$X (t) = X (s) e^{(μ - σ^{2} / 2) (t - s) + σ (W (t) - W (s))} \overset{d}{=} X (s) e^{(μ - σ^{2} / 2) (t - s) + σ \sqrt{t - s} Z},$

where Z ~ Norm(0, 1) is independent of X(s).
The solution to

$d X (t) = μ (t) d t + σ (t) d W (t), X (0) = X_{0},$

is a Gaussian process given by $X (t) = X_{0} + \int_{0}^{t} μ (u) d u + \int_{0}^{t} σ (u) d W (u)$ . By representing X(t) in terms of X(s) for 0 ≤ s < t and using the fact that the Itô integral $\int_{S}^{t} σ (u) d W (u)$ is normally distributed with mean 0 and variance $\int_{S}^{t} σ^{2} (u) d u$ , we obtain that the increments of X are normal:

$X (t) - X (s) ~ N o r m (\int_{s}^{t} μ (u) d u, \int_{s}^{t} σ^{2} (u) d u) .$

Since the increments of X are independent, we are able to model the transition X(s) → X(t) for all 0 ≤ s < t.
The Ornstein–Uhlenbeck process is a solution to the SDE with a constant diffusion coefficient and linear drift:

$d X (t) = α (b - X (t)) d t + σ d W (t), X (0) = X_{0} .$

The strong solution is

$X (t) = e^{- α t} X_{0} + α b \int_{0}^{t} e^{- α (t - u)} d u + σ \int_{0}^{t} e^{- α (t - u)} d W (u) .$

The conditional distribution of X(t) given X(s) for 0 ≤ s < t is normal:

$(X (t) | X (s)) ~ N o r m (e^{- α (t - s)} X (s) + α b \int_{s}^{t} e^{- α (t - u)} d u, σ^{2} \int_{s}^{t} e^{- 2 α (t - u)} d u) .$

17.4.3.2 The PDF Approach

Consider a Markov stochastic process {X(t)}t≥0 starting at X0 so that its transition PDF p given by p(s, t; y, x) dx = ℙ(X(t) ∊ dx | X(s) = y), 0 ≤ s < t, x, y ∊ ℝ, is known in closed form. Moreover, suppose that it is feasible to sample from the PDF p(s, t; y, ·). The general simulation algorithm is as follows.

Algorithm 17.11 Sequential Simulation of a Stochastic Process from Its Transition PDF.

input the PDF p, X0, 0 = t0 < t1 < t2 < · · · < tm

set X(0) = X0

for j = 1 → m do

generate X(tj) ← p(tj−1, tj; X(tj−1), x)

end for

return {X(tj)}0≤j≤m

For example, due to the time- and space-homogeneity property, the Brownian motion transition PDF p(s, t; y, x) reduces to a two-variable function p0 given by

$p_{0} (t; x) = \frac{1}{\sqrt{2 π t}} \exp (- \frac{x^{2}}{2 t})$

as follows: p(s, t; y, x) = p0(t − s; x − y). The latter solves the PDE $\partial_{t} p_{0} = \frac{1}{2} \partial_{x x} p_{0}$ . To generate (W(tj)|{W(tj−1) = y}) ~ p(tj−1, tj; y, x) = p0(tj − tj−1; x − y), we first sample Z ~ Norm(0, 1) and then set $W (t_{j}) \leftarrow y + \sqrt{t_{j} - t_{j - 1}} Z$ .

To sample from a transition PDF, we can use a whole arsenal of sampling techniques such as the inversion method, composition approach, and acceptance-rejection technique. The main example considered in the current section is the exact simulation of a family of Bessel diffusions. We start with a squared Bessel (SQB) process and show that its transition PDF reduces to a randomized gamma distribution. Moreover, as demonstrated in Chapter 16, other processes such as the Cox–Ingersolll–Ross (CIR) process and the constant elasticity of variance (CEV) diffusion model can be obtained from the SQB process by means of a scale and time transformation and change of variable.

Let us consider a λ0-dimensional squared Bessel (SQB) process {X(t) ∊ ℝ+}t≥0 obeying the stochastic differential equation (SDE)

$d X (t) = λ_{0} d t + v \sqrt{X (t)} d W (t), (17.26)$

with constant parameters λ0 and ν > 0. For simplicity of presentation, we assume here that ν = 2. The process X is a time-homogeneous Markov process and its transition PDF is given by (16.15).

The transition PDF (16.15) looks very similar to the PDF (17.8) of the noncentral chi-square distribution. In fact, for the case when μ ≥ 0 or μ ∊ (−1, 0) and x = 0 is a reflecting boundary (so there is no absorption at the origin), the transition PDF p(t; y, x) of the SQB process reduces to the PDF f(x; κ, λ) in (17.8) as follows:

$p (t; y, x) = \frac{1}{t} f (\frac{x}{t}; κ = 2 μ + 2, λ = \frac{y}{t}) .$

Following the composition method, the noncentral chi-square PDF (17.8) can be represented as a randomized gamma distribution (17.10). The value of X(t) conditional on X(s) = y has the randomized gamma distribution

$G a m m a (Y + μ + 1, 2 (t - s)), where Y ~ P o i s (\frac{y}{2 (t - s)}), (17.27)$

for 0 ≤ s < t, y > 0. Therefore, we have the following sampling algorithm for the SQB process without absorption.

Algorithm 17.12 Sampling of an SQB Process without Absorption (Variant 1).

input X(0) = X0 > 0, 0 = t0 < t1 < · · · < tm, μ > −1

for j = 1 → m do

generate $Y_{j} \leftarrow P o i s (\frac{X (t_{j - 1})}{2 (t_{j} - t_{j - 1})})$

generate X(tj) ← Gamma(Yj + μ + 1, 2(tj − tj−1))

end for

return {X(tj)}0≤j≤m

Now consider the case when μ < 0 and x = 0 is an exit or a killing boundary. The stochastic process X admits absorption at the origin. It can be shown that the transition PDF p given by (16.15) with $\tilde{μ} = | μ |, μ < 0$ does not integrate to one. Let us define the probability Psv of surviving before time t and the probability Pab of absorption before time t for the SQB process starting at x0 > 0:

$P_{s υ} (x_{0}; t) = \int_{0}^{\infty} p (t; x_{0}, x) d x > 0 and P_{a b} (x_{0}; t) = 1 - P_{s υ} (x_{0}; t) > 0.$

The probabilities of surviving and absorption of the SQB process before time t are

$P_{s υ} (x; t) = ℙ {T_{0} > t} = \frac{γ (| μ |, \frac{x}{2 t})}{Γ (| μ |)} and P_{a b} (x; t) = ℙ {T_{0} \leq t} = \frac{Γ (| μ |, \frac{x}{2 t})}{Γ (| μ |)},$

respectively, where τ0 is the first hitting time (FHT) at zero. Here,

$γ (a, x) = \int_{0}^{x} t^{a - 1} e^{- t} d t and Γ (a, x) : = Γ (a) - γ (a, x)$

are, respectively, the lower and upper incomplete gamma functions.

Observe that the actual transition probability distribution is then a mixture of continuous and discrete probability distributions with the following generalized density:

$f (X (s) \to X (t)) = P_{s υ} (X (s); t - s) \cdot (\frac{p (t - s; X (s), X (t))}{P_{s υ} (X (s); t - s)}) + P_{a b} (X (s); t - s) \cdot δ (X (t)),$

for 0 ≤ s < t. Here, δ denotes the Dirac delta function that can be viewed as a generalized density of the discrete distribution with an only mass point at zero. With the probability Pab, the process is absorbed at zero. With the additional probability Psv, the process survives. The normalized transition PDF of the SQB process conditioned on the survival of the process before time t is

$\frac{p (t; x_{0}, x)}{P_{s υ} (x_{0}; t)} = \frac{Γ (| μ |)}{γ (| μ |, \frac{x_{0}}{2 t})} {(\frac{x}{x_{0}})}^{\frac{μ}{2}} \frac{e^{- (x + x_{0}) / (2 t)}}{2 t} I_{| μ |} (\frac{\sqrt{x x_{0}}}{t}) . (17.28)$

As is seen, the function on the right-hand side of (17.28) reduces to the form of (17.15) with ν = |μ|, λ = x0/(2t), and θ = 2t. Thus, the above normalized transition PDF follows the randomized gamma distribution of the third kind, Gamma(Y + 1, 2t), where Y ~ IΓ(|μ|, x0/(2t)). As a result, we obtain the following sampling algorithm that returns a sample path [X(t1), X(t2), . . . , X(tm)] and an approximation ${\tilde{T}}_{0} \in {t_{1}, ..., t_{m}, \infty}$ of the FHT τ0.

17.4.4 Diffusion Processes: Approximation Schemes

Approximation schemes can be used whenever an exact method is not available or not feasible. In particular, approximation methods are efficient for the numerical solution of multidimensional stochastic differential equations. Computational schemes for SDEs are based on the same ideas as the numerical methods for solving deterministic differential equations. A discrete-time approximation solution is calculated on a time grid with small time steps. To illustrate the main idea, let us first consider a Cauchy problem for a system of ordinary differential equations written in a vector form:

$\frac{dx (t)}{d t} = a (t, x (t)), t \in [0. T]; x (0) = x_{0} .$

The solution admits the following integral representation on a small time interval [t, t + h]:

$x (t + h) = x (t) + \int_{t}^{t + h} a (s, x (s)) d s . (17.29)$

By applying a rectangle quadrature rule to the integral in (17.29), we derive the so-called Euler approximation scheme:

$x (t + h) \approx x (t) + a (t, x (t)) h .$

Algorithm 17.13 Sampling of an SQB Process with Absorption (Variant 2).

input X0 > 0, 0 = t0 < t1 < · · · < tm, μ < 0

set X(0) ← X0, ${\tilde{T}}_{0} \leftarrow \infty$

for j = 1 → m do

if ${\tilde{T}}_{0} = \infty$ then

set $p_{a} \leftarrow Γ (| μ |, \frac{X (t_{j - 1})}{2 (t_{j} - t_{j - 1})}) / Γ (| μ |)$

generate Uj ← Unif(0, 1)

if Uj < pa then ${\tilde{T}}_{0} \leftarrow t_{j}$

end if

if $t_{j} \leftarrow {\tilde{T}}_{0}$ then

generate $Y_{j} \leftarrow I Γ (| μ |, \frac{X (t_{j - 1})}{2 (t_{j} - t_{j - 1})})$

generate $X (t_{j}) \leftarrow G a m m a (Y_{j} + 1, 2 (t_{j} - t_{j - 1}))$

else

set X(tj) ← 0

end if

end for

return {X(tj)}0≤j≤m and ${\tilde{T}}_{0}$

A discrete-time numerical solution ${X_{k}^{h}}_{k = 0, 1, 2, ...}$ on the time grid {tk := kh}k=0,1,2,... is given by

$X_{0}^{h} = X_{0}, X_{k + 1}^{h} = X_{k}^{h} + a (t_{k}, X_{k}^{h}) h, k = 0, 1, 2, ...$

The numerical solution approximates the genuine solution: $X_{k}^{h} \approx X (t_{k}), k \geq 1$ ; it converges to the exact one as the time step h goes to zero. Applying a similar approach to stochastic differential equations, we can construct a discrete-time approximate realization (a skeleton) of a sample path. Since we deal with stochastic processes, there are different types of convergence of the numerical solution to the exact one.

17.4.4.1 Types of Convergence

Consider a continuous-time d-dimensional stochastic process {X(t)}t∊[0,T] and its discrete-time approximation ${X_{k}^{h}}_{0 \leq k \leq m}$ defined on a time grid 0 = t0 < t1 < · · · < tm = T with the maximum step size h = max{tk − tk−1 : k = 1, 2, . . . , m}. The simplest time grid is an equally spaced grid with tk = kh with $h = \frac{T}{m}$ and k = 0, 1, . . . , m. Let us analyze the convergence of the approximate solution Xh to the genuine solution X, as h → ∞, in the Euclidian vector norm ${|| x ||}_{2} = \sqrt{x_{1}^{2} + ... + x_{d}^{2}}$ , where x = [x1, x2, . . . , xd]⊤ ∊ ℝd. We say that the approximation Xh has:

the strong order of convergence α > 0 if there exists C > 0 so that E

$E [|| X_{k}^{h} - X (t_{k}) ||] \leq C h^{α}, \forall k = 0, 1, 2, ..., m;$
the mean-square order of convergence β > 0 if there exists C > 0 so that

${(E [{|| X_{k}^{h} - X (t_{k}) ||}^{2}])}^{1 / 2} \leq C h^{β}, \forall k = 0, 1, 2, ..., m;$
the weak order of convergence γ > 0 if for any real-valued function f selected from some large class of functions (usually, f is a sufficiently smooth function) $\exists C_{f} > 0$ so that

$| E [f (X_{m}^{h}) - f (X (T))] | \leq C_{f} h^{γ} .$

17.4.4.2 The Euler Scheme

Consider a multivariate SDE

$d X (t) = a (t, X (t)) d t + b (t, X (t)) dW (t), (17.30)$

where W(t) is an d-dimensional standard Brownian motion with independent components, and a: ℝ × ℝd → ℝd and b: ℝ × ℝd → ℝd×d are, respectively, the drift and diffusion coefficient functions. The SDE (17.30) can be written in the integral form on [t, t + h], t ≥ 0, h > 0:

$X (t + h) = X (t) + \int_{t}^{t + h} a (s, X (s)) d s + \int_{t}^{t + h} b (s, X (s)) d W (s) . (17.31)$

Application of the rectangular approximation formula to each integral in (17.31) gives

$X (t + h) \approx X (t) + a (t, X (t)) h + b (t, X (t)) (W (t + h) - W (t)) . (17.32)$

So, the Euler discrete-time approximation on a time grid 0 = t0 < t1 < · · · < tm = T is given by

$X_{0}^{h} = X_{0}, X_{k + 1}^{h} = X_{k}^{h} + a (t_{k}, X_{k}^{h}) h + b (t_{k}, X_{k}^{h}) \sqrt{t_{k} - t_{k - 1}} Z_{k}, 0 \leq k \leq m - 1, (17.33)$

where {Zk}k≥0 is a sequence of i.i.d. multivariate Normd(0, I)-distributed vectors.

Theorem 17.12

The Euler scheme has strong order $\frac{1}{2}$ and weak order 1. Moreover, the weak error admits the following expansion:

$E [f (X_{m}^{h}) - f (X (T))] = C_{f} h + O (h^{2}) .$

For a proof, see Kloeden and Platen (2011).

17.4.4.3 Extrapolation

The process of extrapolation uses two approximations computed from the same formula but with different step sizes to obtain higher-order approximation. The Richardson extrapolation method allows us to achieve second-order accuracy for a first-order scheme. For the Euler method with a constant time step h, we have

$E [f (X_{m}^{h})] = E [f (X (T))] + C_{f} h + O (h^{2}) .$

for some constant Cf that depends on f. Suppose that the number of steps m is even. Apply the Euler method with steps $h = \frac{T}{m} and 2 h = \frac{T}{m / 2}$ to obtain approximations $X_{m}^{h}$ and $X_{m / 2}^{2 h}$ of the time-T realization X(T), respectively. By combining

$E [f (X_{m}^{h})] = E [f (X (T))] + C_{f} h + O (h^{2}) and E [f (X_{m / 2}^{2 h})] = E [f (X (T))] + C_{f} 2 h + O (h^{2}),$

we can eliminate the leading term of the error expansion:

$E [2 f (X_{m}^{h}) - f (X_{m / 2}^{2 h})] = E [f (X (T))] + O (h^{2}) .$

The error of the combined estimate is of (weak) order 2. The variance of the combined estimator is

$Var (2 f (X_{m}^{h}) - f (X_{m / 2}^{2 h})) = 4 Var (f (X_{m}^{h})) + Var (f (X_{m / 2}^{2 h})) - 4 Cov (f (X_{m}^{h}), f (X_{m / 2}^{2 h})) .$

By making $f (X_{m}^{h})$ and $f (X_{m / 2}^{2 h})$ positively correlated, we can reduce the variance. This can be achieved by using consistent Brownian increments in simulating paths of Xh and X2h. Suppose that $(X_{m}^{h})$ is constructed from m i.i.d. normal vectors

$\sqrt{h} Z_{0}, ..., \sqrt{h} Z_{m - 1} ~ N o r m_{d} (0, \sqrt{h} I) .$

The same normal increments can be used for sampling X2h. For example, to construct $X_{m / 2}^{2 h}$ we use the normally distributed vectors

$\sqrt{h} (Z_{0} + Z_{1}), ..., \sqrt{h} (Z_{m - 2} + Z_{m - 1}) ~ N o r m_{d} (0, \sqrt{2 h} I) .$

Here, the property that a sum of two normal random variables is again normally distributed is applied.

17.4.4.4 Error Analysis

Suppose that we wish to approximate some quantity Q by using a biased estimator Yh where h is a discretization parameter approaching zero. For example,

$Q : = E [f (X (T))] \approx E [f (X_{m}^{h})],$

where Xh is the Euler approximation of a diffusion X and $h = \frac{T}{m}$ is a time step size. Introduce the approximation error ε(h) = Q − E[Yh]. Suppose that

$ε (h) \approx C_{1} h^{β} (17.34)$

holds for some positive constants C1 and β. Taking the logarithm of both parts of (17.34) gives

$\log ε (h) \approx \log C_{1} + β \log h .$

As is seen from the above equation, we can use the linear regression method to calculate C1 and β. First, we compute the sample estimate of Q for a decreasing sequence of values of h:

$Q \approx {\bar{y}}_{n}^{h_{k}} = \frac{1}{n} \sum_{j = 1}^{n} y_{j}^{h_{k}}, k = 1, 2, ...,$

where $y_{j}^{h_{k}}$ are i.i.d. samples of $Y^{h_{k}}$ and h1 > k2 > h3 > . . .. For example, we set hk = 2−k. Assuming that the number of draws, n, is sufficiently large so that the statistical error is negligible in comparison with the approximation error, we obtain

$\bar{ε} (h_{k}) : = Q - {\bar{y}}_{n}^{h_{k}} \approx Q - E [Y^{h_{k}}] \approx C_{1} h^{β} \Rightarrow \log \bar{ε} (h_{k}) \approx \log C_{1} + β \log h_{k} .$

Now, we plot $\log \bar{ε} (h_{k})$ versus log hk for all hk and then perform a linear regression. As a result, the slope of the regression line gives us the order of approximation β.

In fact, the error $\bar{ε} (h_{k}) = Q - {\bar{y}}_{h}^{h}$ includes two components: the approximation bias and the statistical (Monte Carlo) error. To optimize the method, it is helpful to separate these two errors:

$Q - {\bar{y}}_{n}^{h} = \underset{\approx C_{1} h^{β}}{\underset{︸}{Q - E [Y^{h}]}} + E [Y^{h}] - {\bar{y}}_{n}^{h} .$

Define the mean-square (statistical) error as

$MSE (h, n) = E [{(Q - {\bar{Y}}_{n}^{h})}^{2}],$

where ${\bar{Y}}_{n}^{h} = \frac{1}{n} \sum_{j = 1}^{n} Y_{j}^{h}$ is the sample estimator. The value of MSE(h, n) is

$\begin{matrix} E [{(Q - {\bar{Y}}_{n}^{h})}^{2}] = & E [{((Q - E [Y^{h}]) + (E [Y^{h}] - {\bar{Y}}_{n}^{h}))}^{2}] \\ = & E [{(Q - E [Y^{h}])}^{2}] + 2 E [(Q - E [Y^{h}]) (E [Y^{h}] - {\bar{Y}}_{n}^{h})] + E [{(E [Y^{h}] - {\bar{Y}}_{n}^{h})}^{2}] \\ \approx & {(C_{1} h^{β})}^{2} + 2 (Q - E [Y^{h}]) E [E [Y^{h}] - {\bar{Y}}_{n}^{h}] + Var ({\bar{Y}}_{n}^{h}) \\ = & {(C_{1} h^{β})}^{2} + Var ({\bar{Y}}_{n}^{h}) = C_{1}^{2} h^{2 β} + \frac{1}{n} Var (Y^{h}) \approx C_{1}^{2} h^{2 β} + Var (Y^{0 +}) \\ = & C_{1}^{2} h^{2 β} + \frac{C_{2}}{n} . \end{matrix}$

Here the constant C2 = Var(Y0+) is defined as the limiting value of the variance Var(Yh), as $h ↘ 0$ . To find the optimal values of the sample volume n and the discretization parameter h, we will minimize the computational cost for a given level of error: MSE = ε2. Clearly, the computational cost (i.e., the number of operations) is directly proportional to m and n. The number of steps $m = \frac{T}{h}$ is inversely proportional to h. Hence, the computational cost (or the runtime) is given by C3n/h, where C3 is another positive constant. Let us minimize the computational cost under the constraint that the mean-square error is fixed and equal to ε. The optimization problem takes the following form:

${\begin{array}{l} \frac{C_{3} n}{h} \to \min_{n, h}, \\ MSE (h, n) = C_{1}^{2} h^{2 β} + \frac{C_{2}}{n} = \in^{2} . \end{array} (17.35)$

The problem (17.35) is easy to solve. As a result, we obtain that the computational cost is proportional to ε−2−1/β. Alternatively, we can solve the problem of minimizing MSE subject to a fixed computational cost s (see Exercise 17.23).

17.4.5 Simulation of Processes with Jumps

17.4.5.1 Poisson Processes

Recall that a Poisson process is a continuous-time stochastic process {Nλ(t)}t≥0 with independent, stationary, Poisson distributed increments that starts at zero. In other words,

(a) Nλ(0) = 0;
(b) for all m and 0 ≤ t0 ≤ t1 ≤ · · · ≤ tm, the increments Nλ(tk) − Nλ(tk−1), 1 ≤ k ≤ m, are independent random variables;
(c) Nλ(t) − Nλ(s) ~ Pois(λ(t − s)) for 0 ≤ s < t.

The parameter λ is called the intensity of the process.

Every realization of a Poisson process is a step function that starts at the origin. It stays at each level k ≥ 0 for a random time period and then jumps to the next level k + 1. The occurrence time Tk is the time when the process jumps from k − 1 to k for k ≥ 1. Set τ1 = T1 and τk = Tk − Tk−1 for k ≥ 2. These variables {τk}k≥1 are called durations. We can express {Tk}k≥1 in terms of {τk}k≥1 as

$T_{m} = \sum_{k = 1}^{m} T_{k} .$

Thus, a Poisson process can be defined via durations as follows:

$N_{λ} (t) = \sup {m : \sum_{k = 1}^{m} T_{k} \leq t} . (17.36)$

A realization of a Poisson process can be generated from the sample values of the durations. The probability distribution of the random variables {τk}k≥1 is given by the following result.

Proposition 17.13.

Consider a Poisson process with occurrence times Tk, k ≥ 1, and durations τ1 = T1, τk = Tk − Tk−1, k ≥ 2. Then,

(a) τk, k ≥ 1, are jointly independent, Exp(λ)-distributed random variables;
(b) Tk ~ Gamma(k, λ), k ≥ 1.

For a complete proof of Propositions 17.13 and 17.14, see Gut (2009).

By using Proposition 17.13 and Equation 17.36, we come up with the following sampling algorithm of the Poisson process on [0, T].

Algorithm 17.14 Sampling a Poisson Process (Variant 1).

Input: λ > 0, T > 0.

(1) Simulate i.i.d. τ1, τ2, . . . , τm ∊ Exp(λ) where m = sup{k : τ1 + · · · + τk ≤ T}.
(2) Set $T_{k} = \sum_{j = 1}^{k} T_{j} for k = 1, 2, ..., m$ .
(3) Define $N_{λ} (t) = \sum_{k = 1}^{m} I_{{T_{k} \leq t}} for 0 \leq t \leq T$ .

Another algorithm for sampling a Poisson process is based on conditioning on the number of occurrences in a time interval. As it turns out, the joint distribution of the occurrence times conditional on the number of occurrences is the same as that of the order statistics of a sample from a uniform distribution.

Proposition 17.14.

The joint density of T1, T2, . . . , Tm conditional on Nλ(T) = m is

$f_{T_{1}, ..., T_{m} | N_{λ} (T) = m} (t_{1}, ..., t_{m}) = {\begin{array}{l} \frac{m!}{T^{m}} f o r 0 < t_{2} < \dots < t_{n} < T, \\ 0 o t h e r w i s e . \end{array}$

In other words,

$(T_{1}, T_{2}, ..., T_{m} | N_{λ} (T) = m) \overset{d}{=} (U_{(1)}, U_{(2)}, ..., U_{(m)}),$

where U(1), U(2), . . . , U(m) are the order statistics defined by sorting m i.i.d. Unif(0, T)-distributed random variables in increasing order.

Algorithms 17.14 and 17.15 allow us to sample a Poisson path on [0, T]. If we want to continue sampling on (T, T + T ′], T ′ > 0, then the sample path on [0, T] can be reused thanks to the following property.

Proposition 17.15.

If {N(t)}t≥0 is a Poisson process, then so are

{N(t + s) − N(s)}t≥0 for every fixed s > 0;
{N(t+Tk)−N(Tk)}t≥0 for every fixed k ≥ 1, where Tk is the time of the kth occurrence of the original Poisson process N.

This result, along with the property of independence of Poisson increments, allows us to simulate a Poisson process individually on disjoint intervals. Suppose we have generated the process Nλ on [0, T]. To continue the sample path on (T, T + T ′], we first generate another Poisson process ${{\tilde{N}}_{λ} (t)}_{t \in [0, T^{'}]}$ independently of {Nλ(t)}t∊[0,T]. Second, we set $N_{λ} (t) = N_{λ} (T) + {\tilde{N}}_{λ} (t - T)$ for all t ∊ (T, T + T′].

The jumps of a Poisson process all have size one. A compound Poisson process is defined in such a way that the size of each of its jumps is random (see Section 16.2). Let Y1, Y2, . . . be i.i.d. random variables which are all independent of the Poisson process Nλ(t). A compound Poisson process with jump sizes Yk, k ≥ 1, is

$X (t) = \sum_{k = 1}^{N_{λ} (t)} Y_{k}, t \geq 0. (17.37)$

The simulation of a compound Poisson process on [0, T] is straightforward. Applying Algorithm 17.14 or 17.15 gives us a sequence of occurrence times Tk, 1 ≤ k ≤ m, where m = Nλ(t) ~ Pois(λT). After that we set

$X (t) = \sum_{k = 1}^{m} I_{{T_{k} \leq t}} Y_{k} .$

Algorithm 17.15 Sampling a Poisson Process (Variant 2).

Input: λ > 0, T > 0.

(1) Simulate Nλ(T) ~ Pois(λT). Set m = Nλ(T).
(2) Simulate i.i.d. U1, U2, . . . , Um ~ Unif(0, T). Sort them in increasing order:

$0 \leq U_{(1)} \leq U_{(2)} \leq ... \leq U_{(m)} .$
(3) Set Tk = U(k) for k = 1, 2, . . . , m.
(4) Define $N_{λ} (t) = \sum_{k = 1}^{m} I_{{T_{k} \leq t}} for 0 \leq t \leq T$ .

17.4.5.2 Subordinated Processes

The variance gamma (VG) process is a three-parameter generalization of the Brownian motion model for the dynamics of the logarithm of the stock price. It is obtained by evaluating a scaled Brownian motion with drift at a random time given by a gamma process (see Madan and Seneta (1990)). The gamma process G(t; μ, υ) with mean rate μ > 0 and variance rate υ > 0 is a random process with independent gamma increments over nonoverlapping intervals of time.

The VG process X(t; σ, υ, θ) is defined in terms of the scaled Brownian motion with drift, B(t) ≡ W(θ,σ)(t) = θt + σW (t), and the gamma process with unit mean rate, denoted G(t) ≡ G(t; 1, υ), as

$X (t; σ, υ, θ) : = B (G (t)), t \geq 0.$

The PDF of the VG process at time t can be expressed as a normal density function conditional on the realization of the gamma time change. The risk-neutral process for the asset price is given by

$S (t) : = S_{0} \exp ((r - q - ω) t + X (t; σ, υ, θ)), t \geq 0, (17.38)$

where r and q are, respectively, the risk-neutral interest rate and dividend yield. The parameter ω = ln(1 − θυ − σ2υ/2)/υ is chosen so that the discounted asset price process e−(r−q)tS(t) is a true martingale.

Modelling of the variance gamma process relies on sampling from the normal and gamma probability distributions. One needs first to generate the gamma process and then to sample the Brownian motion conditional on the obtained values of the stochastic time process. Note that both the gamma process and Brownian motion are random processes with independent stationary increments. Thus, to sample a path of the variance gamma process at a discrete sequence 0 = t0 < t1 < t2 < · · · < tN, it is sufficient to generate the gamma increment G(ti) − G(ti−1) and then the Brownian increment B(gi) − B(gi−1) conditional on G(ti) = gi and G(ti−1) = gi−1 for all i = 1, 2, . . . , N. The sample values of G(ti) and X(ti) = B(G(ti)) can then be obtained by calculating cumulative sums of the respective increments. The increments of the gamma process G with mean rate one and variance rate θ and the Brownian process B with parameters θ and σ can be simulated as stated below:

G(t2) − G(t1) has the Gamma((t2 − t1)/θ,θ) distribution for any 0 < t1 < t2;

B(g2) − B(g1) has the Norm(μ(g2 − g1), σ2(g2 − g1)) distribution for any 0 < g1 < g2.

By repeating the above procedure n times and taking the cumulative sums of the increments, we can obtain the values of the variance gamma process at a discrete sequence of time moments. Algorithm 17.16 is used for sampling paths of the variance gamma process.

Algorithm 17.16 Simulation of the Gamma and Variance Gamma Processes.

input X0, 0 = t0 < t1 < t2 < · · · < tN , μ, σ, θ

G(0) ← 0

for k from 1 to N do

Δ Gk ~ Gamma((tk − tk−1)/θ,θ)

Δ Xk ~ Norm(μ ΔGk, σ2 ΔGk)

G(tk) ← G(tk−1) + ΔGk

X(tk) ← X(tk−1) + Δ Xk

end for

return {G(tk)}1≤k≤N and {X(tk)}1≤k≤N

17.5 Variance Reduction Methods

Consider the evaluation of a mathematical expectation E[h(X)], where h is a real-valued function of a random variable X having a PDF f. We call H ≡ h(X) a direct Monte Carlo estimator as opposed to an estimator with some variance reduction techniques embedded. The direct Monte Carlo sample estimator of E[h(X)] is

${\bar{H}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} h (X_{i}),$

where {Xi}i≥1 are i.i.d. random variables with the common PDF f. A direct sample estimate is the following average with statistically independent realizations x1, x2, . . . , xn of X:

${\bar{h}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} h (x_{i}) .$

A large variance of the estimator h(X) results in slow convergence of the sample estimate to E[h(X)]. Any modification of the direct Monte Carlo method that results in a decrease of the variance is called a variance reduction method. One example of variance reduction techniques is the importance sampling method, which is discussed below. The goal of this section is to summarize various techniques used to improve the direct Monte Carlo method. Since a modification of the original estimator may result in an increase in computing time, we compare methods on the basis of their computational costs.

17.5.1 Numerical Integration by a Direct Monte Carlo Method

Consider a multidimensional integral $I (g) = \int_{ℝ^{d}} g (x) d x$ . If the number of dimensions d is small (say, d ≤ 3) then deterministic quadrature rules can be successfully applied to evaluate I(g). For larger dimensions, it is more beneficial to use stochastic methods. As is pointed out in the beginning of this chapter, to apply the Monte Carlo method (MCM), the quantity of interest, say Q, needs to be represented in the form of a mathematical expectation of a random variable called the estimator of Q. Select a d-variate PDF f such that f(x) ≠ 0 if g(x) ≠ 0 for all x ∊ ℝd. If the integrand g has an integrable singularity at some point x0, then the PDF f should also be singular at x0 so that $\lim_{x \to x_{0}} \frac{g (x)}{f (x)}$ exists and is finite. Moreover, it is reasonable to select f having the same support as that of g. If the integral is taken on a manifold, e.g., a sphere, then the support of the PDF f has to be the same manifold. Rewrite the integral I(g) as follows:

$I (g) = \int_{ℝ^{d}} g (x) dx = \int_{ℝ^{d}} \frac{g (x)}{f (x)} f (x) d x = E [h (X)], (17.39)$

where X ~ f; $h (x) : = \frac{g (x)}{f (x)}$ if g(x) ≠ 0 and g(x) ≠ ∞, h(x) := 0 if g(x) = 0, and $h (x_{0}) : = \lim_{x \to x_{0}} \frac{g (x)}{f (x)} if g (x_{0}) = \infty$ . The integral I(g) = E[h(X)] is estimated by the Monte Carlo method as follows.

(1) Generate n independent sample values ${x_{i}}_{i = 1}^{n}$ drawn from the PDF f.
(2) Construct a sample estimate of the integral I(g):

$I (g) \approx {\bar{h}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} h (x_{i}) .$
(3) For 0 < α < 1, construct an asymptotically valid 100(1 − α)% confidence interval for I(g):

$({\bar{h}}_{n} - \frac{z_{α / 2} s_{n}}{\sqrt{n}}, {\bar{h}}_{n} + \frac{z_{α / 2} s_{n}}{\sqrt{n}}) ∍ I (g),$

where $S_{n}^{2} = \frac{1}{n} {\sum_{i = 1}^{n} h {(x_{i})}^{2} - \bar{h}}_{n}^{2}$ is a sample variance for h(X), and zα/2 is a (1 − α/2)-quantile of the standard normal distribution.

As is seen, the Monte Carlo method allows us to simultaneously construct an approximation of I(g) and an upper bound for the error:

$| I (g) - {\bar{h}}_{n} | \leq \frac{z_{α / 2} \sqrt{Var (h (X))}}{\sqrt{n}}, (17.40)$

which is valid with confidence (1 − α)100%. The variance Var(h(X)) can be approximated by the sample variance $S_{n}^{2}$ . As is seen from (17.40), the approximation error is of order $O (n^{- 1 / 2})$ , as n → ∞. Since the number of sample values n is equal to the number of times the integrand g is evaluated, the Monte Carlo method can be compared with deterministic quadrature rules for which the approximation error is known in terms of n. For example, the midpoint quadrature rule provides the approximation error of order $O (n^{- 2 / d})$ , as n → ∞. In contrast to the Monte Carlo method, the error of a quadrature rule depends the dimensionality d. The Monte Carlo method outperforms the midpoint rule for problems with d > 4. In general, the Monte Carlo method outperforms deterministic methods if d ≥ 10. For lower dimensionality (say, d ≤ 3), it is better to use a deterministic quadrature rule. For the case with 3 < d < 10, a combination of stochastic and deterministic methods may be a more efficient approach to computing the integral. The Monte Carlo method is said to beat the curse of dimensionality since its performance is not considerably affected by the dimensionality of the problem. Another advantage of the Monte Carlo method is that it can be applied to integrals of nonsmooth functions, whereas deterministic quadrature rules require the smoothness of higher-order derivatives of integrands.

The computational time required to calculate the estimate ${\bar{h}}_{n}$ is a product of the number of samples, n, and the time tH required to calculate one sample value of H := h(X). Let δ be a given error level. Using (17.40), we obtain

$\frac{z_{α / 2} \sqrt{Var (h (X))}}{\sqrt{n}} = δ \Leftrightarrow n = \frac{z_{α / 2}^{2} Var (h (X))}{δ^{2}} .$

Therefore, the total computational time tH · n is proportional to $\frac{t_{H} Var (h (X))}{δ^{2}}$ . The product tH ·Var(h(X)) is called the computational cost of the stochastic estimator h(X). Clearly, we can accelerate the Monte Carlo method by reducing the variance of the estimator. Another approach is to parallelize computations.

17.5.2 Importance Sampling Method

Clearly, there are many choices for the PDF f in (17.39) that fit for the estimation of I(g) by the Monte Carlo method. One obvious requirement on the density f is that it should include all singularities of the integrand to guarantee that the variance of the random estimator is finite. For example, let us estimate the value of the integral

$I = \int_{0}^{1} \frac{g (x)}{\sqrt{x}} d x, where g \in C [0, 1] and g (0) \neq 0 (17.41)$

by the Monte Carlo method. Consider the following two choices for the PDF f.

Choice 1: Let $f (x) = I_{(0, 1)} (x)$ , i.e., X ~ Unif(0, 1). It is easy to verify that the variance of the random estimator $h (X) : = \frac{g (X)}{\sqrt{X}}$ is infinite.

Choice 2: Let $f (x) \propto \frac{1}{\sqrt{x}}$ for x ∊ (0, 1). Since f integrates to one, we set $f (x) = \frac{1}{2 \sqrt{x}} I_{(0, 1)} (x)$ . Then, the random estimator

$h (X) = \frac{g (X)}{\sqrt{X} f (X)} = \frac{g (X)}{2}$

is bounded and hence has a finite variance.

Example 17.13.

Construct a Monte Carlo estimator of the integral

$I = \int_{0}^{\infty} \dots \int_{0}^{\infty} \frac{e^{- x_{1} - x_{2} - \dots - x_{8}}}{{(1 + x_{1} \dots x_{8})}^{2}} d x_{1} \dots d x_{8} .$

Solution. Notice that the integrand is a product of the function $e^{- x_{1} - x_{2} - \dots - x_{8}}$ , which is a product of exponential PDFs, and the positive, bounded function (1 + x1 · · · x8)−2. Let us select the PDF

$f (x_{1}, x_{2}, ..., x_{8}) = e^{- x_{1} - x_{2} - \dots - x_{8}} = e^{- x_{1}} e^{- x_{2}} \dots e^{- x_{8}}, x_{1}, x_{2}, ..., x_{8} > 0.$

That is, the entries of the random vector X are independent exponentially distributed variables Xj ~ Exp(1), 1 ≤ j ≤ 8. Set the random estimator to be

$h (X) = \frac{1}{{(1 + X_{1} X_{2} \dots X_{8})}^{2}} .$

To find an upper bound of Var(h(X)), use the following property (see Exercise 17.24). Suppose that Y is a bounded random variable such that 0 ≤ m1 ≤ Y ≤ m2 for some constants m1 and m2. Then,

$Var (Y) \leq \frac{{(m_{2} - m_{1})}^{2}}{4} .$

Applying this property to Y = h(X) ∊ [0, 1], obtain that $Var (h (X)) \leq \frac{1}{4}$ .

As is seen from the above examples, it is reasonable to select the PDF f as close to the integrand g as possible. This suggestion is confirmed by the importance sampling principle, which is presented just below. Since the computational cost is proportional to the variance of the estimator, the optimal PDF f solves the following optimization problem:

$Var (\frac{g (X)}{f (X)}) = \int_{ℝ^{d}} \frac{g^{2} (x)}{f (x)} dx - I^{2} (g) \to \min_{f} . (17.42)$

Let us prove that the variance attains its minimum value if f ∝ |g|. The proof is based on the Cauchy–Schwartz–Bunyakovsky inequality

${(\int_{ℝ^{d}} | u (x) υ (x) | dx)}^{2} \leq \int_{ℝ^{d}} u^{2} (x) dx \cdot \int_{ℝ^{d}} υ^{2} (x) dx, (17.43)$

for u, υ ∊ L2(ℝd) (here L2 is the set of square-integrable functions). Let us set $u (x) = \frac{g (x)}{\sqrt{f (x)}}$ and $υ (x) = \sqrt{f (x)}$ to obtain

$\begin{matrix} {(\int_{ℝ^{d}} | g (x) | dx)}^{2} = {(\int_{ℝ^{d}} | \frac{g (x)}{\sqrt{f (x)}} \sqrt{f (x)} | dx)}^{2} \\ \leq \int_{ℝ^{d}} \frac{g^{2} (x)}{f (x)} dx \cdot \underset{= 1}{\underset{︸}{\int_{ℝ^{d}} f (x) dx}} = \int_{ℝ^{d}} \frac{g^{2} (x)}{f (x)} dx . \end{matrix}$

Therefore, we obtain a lower bound of the variance

$Var (\frac{g (X)}{f (X)}) \geq {(\int_{ℝ^{d}} | g (x) | dx)}^{2} - I^{2} (g) .$

Now, it remains to show that the variance attains the lower bound when the PDF is proportional to |g|, that is, when $f (x) = \frac{1}{c} | g (x) |$ with $c = \int_{ℝ^{d}} | g (x) | d x$ . Indeed, for $f = \frac{1}{c} | g |$ , we have

$\int_{ℝ^{d}} \frac{g^{2} (x)}{f (x)} dx = \int_{ℝ^{d}} \frac{c {| g (x) |}^{2}}{| g (x) |} dx = c \int_{ℝ^{d}} | g (x) | dx = {(\int_{ℝ^{d}} | g (x) | dx)}^{2} .$

In practice, it may be impossible to use the “best” PDF since the numerical evaluation of the normalizing constant c = I(|g|) is equivalent to the original problem. Alternatively, we can use an acceptance-rejection method that does not require the normalizing constant to be calculated. If sampling from f ∝ |g| is not feasible or is computationally expensive, then one can use any approximation PDF that is close to the optimal PDF. For example, a piecewise approximation of |g| can be used to construct the sampling density.

Example 17.14.

Approximate the integral $I = \int_{0}^{π / 2} \sin x d x = 1$ by the Monte Carlo method where the sampling PDF is

(a) $f_{1} (x) = \frac{2}{π} I_{(0, π / 2)} (x)$ (a constant approximation of g);
(b) $f_{2} (x) = \frac{8 x}{π^{2}} I_{(0, π / 2)} (x)$ (a linear approximation of g).

Solution. Let us compare the variances $σ_{i}^{2} = Var (H_{i})$ of the random estimators $H_{i} = \frac{g (X_{i})}{f_{i} (X_{i})}$ with Xi ~ fi for i = 1, 2. We have

$\begin{array}{l} σ_{1}^{2} = \int_{0}^{π / 2} \frac{\sin^{2} x}{2 / π} d x - 1 = \frac{π^{2}}{8} - 1 ≅ 0.2337; \\ σ_{2}^{2} = \int_{0}^{π / 2} \frac{\sin^{2} x}{8 x / π} d x - 1 ≅ 0.0168. \end{array}$

By using the PDF f2 instead of f1, we can reduce the computational cost by approximately $\frac{σ_{1}^{2}}{σ_{2}^{2}} ≅ 14$ times.

17.5.3 Change of Probability Measure

Let us consider another PDF $\hat{f}$ such that $\hat{f} (x) > 0$ for all x ∊ ℝ with f(x)h(x) ≠ 0. Apply the change of measure method to obtain:

$Q = E [h (X)] = E [h (\hat{X}) \frac{f (\hat{X})}{\hat{f} (\hat{X})}] (17.44)$

where the last mathematical expectation in (17.44) is relative to $\hat{X} ~ \hat{f}$ . The weight function $f / \hat{f}$ is called the likelihood function. According to the importance sampling method, the optimal PDF $\hat{f}$ is proportional to the product f ·|g|. However, sampling from $\hat{f} \propto f \cdot | g |$ may not be feasible; therefore, one may use one of the following simple methods to “improve” the sampling density.

(a) Shifting the PDF: ${\hat{f}}_{c} (x) : = f (x - c)$ for some c ∊ ℝ. The point c may be chosen in accordance with the maximum principle so that ${\hat{f}}_{c}$ and f ·|g| attain their maximum at the same point. For example, consider a normal density f which reaches its maximum value at the mean μ. In this case, set c = ν − μ, where ν = arg max{f(x) · |g(x)|}.
(b) Reshaping the PDF: ${\hat{f}}_{c} (x) : = \frac{1}{c} f (x / c)$ for some c ∊ ℝ.

17.5.4 Control Variate Method

The main idea of the control variate method is to represent the unknown quantity as a sum of two parts. One part can be calculated analytically and the other part is to be estimated by the Monte Carlo method. Such a splitting is expected to decrease the variance. For example, consider the evaluation of $I (g) = \int_{ℝ^{d}} g (x) d x$ . Suppose that there exists another function g0, which is close to g and for which the integral $I (g_{0}) = \int_{ℝ^{d}} g_{0} (x) d x$ can be calculated analytically. Write I(g) = I(g0) + I(g − g0). To approximate the integral I(g − g0) we apply the Monte Carlo method with a PDF f:

$I (g) = I (g_{0}) + E [\frac{g (X) - g_{0} (X)}{f (X)}], X ~ f . (17.45)$

If g0 is close to g, then the difference g − g0 is close to zero. So we expect that the Monte Carlo estimator of I(g − g0) has a smaller variance than that of the original integral I(g). Rewrite (17.45) as follows:

$I (g) = E [\frac{g (X)}{f (X)} - (\frac{g_{0} (X)}{f (X)} - I (g_{0}))] = E [Y - (Z - I (g_{0}))], (17.46)$

where $Y : = \frac{g (X)}{f (X)}$ and $Z : = \frac{g_{0} (X)}{f (X)}$ are, respectively, unbiased estimators of I(g) and I(g0).

Let us apply this method to the estimation of an arbitrary mathematical expectation Q = E[Y] of some random variable Y . Suppose that there exists another random variable Z with known mathematical expectation μZ such that Y and Z can be sampled simultaneously. Construct a new parametric family of random variables:

$Y (b) : = Y - b (Z - μ_{Z}), b \in ℝ .$

Clearly, the mathematical expectation of Y(b) is equal to Q for all b ∊ ℝ:

$E [Y (b)] = E [Y] - b [Z - μ_{Z}] = E [Y] = Q .$

Thus, for every b ∊ ℝ, Y (b) is an unbiased estimator of Q. The random variable Z is called a control variate; Y (b), b ∊ ℝ, is called a controlled estimator. The sample estimate ${\bar{y}}_{n} (b)$ is constructed as usual.

Algorithm 17.17 The Control Variate Method.

(1) Generate n independent realizations (yj, zj), j = 1, 2, . . . , n.

(2) Set ${\bar{y}}_{n} (b) = \frac{1}{n} \sum_{j = 1}^{n} (y_{j} - b \cdot (z_{j} - μ_{z}))$ .

Proposition 17.16.

The optimal value of b chosen so that Var(Y (b)) reaches its minimum value is given by

$b_{o p t} = \frac{Cov (Y, Z)}{Var (Z)} . (17.47)$

The minimum variance of the controlled estimator is

$Var (Y (b_{o p t})) = Var (Y) \cdot (1 - Corr {(Y, Z)}^{2}) .$

Proof. The variance of the controlled estimator is a quadratic function of b:

$Var (Y (b)) = Var (Z) b^{2} - 2 Cov (Y, Z) b + Var (Y) .$

Differentiate it with respect to b and equate the derivative obtained to zero:

$\frac{d Var (Y (b))}{d b} = 2 Var (Z) b - 2 Cov (Y, Z) = 0 \Rightarrow b = \frac{Cov (Y, Z)}{Var (Z)} .$

Since the second derivative is positive, the variance attains its minimum value at the point bopt given by (17.47). The variance of the controlled estimator Y (b) for b = bopt is

$\begin{matrix} Var (Y (b_{opt})) = & Var (Z) \frac{{Cov}^{2} (Y, Z)}{{Var}^{2} (Z)} - 2 Cov (Y, Z) \frac{Cov (Y, Z)}{Var (Z)} + Var (Y) \\ = & Var (Y) - \frac{{Cov}^{2} (Y, Z)}{Var (Z)} - Var (Y) \cdot (1 - \frac{{Cov}^{2} (Y, Z)}{Var (Y) Var (Z)}) \\ = & Var (Y) \cdot (1 - Corr {(Y, Z)}^{2}) . \end{matrix}$

Thus, Var(Y (bopt)) ≤ Var(Y) since 0 ≤ Corr(Y, Z)2 ≤ 1.

In practice, Var(Z) and/or Cov(Y, Z) are unknown in closed form. So, the optimal value of b can be approximated by the ratio of the sample covariance of Y and Z to the sample variance of Z. The variance reduction factor

$\frac{Var (Y)}{Var (Y (b_{opt}))} = \frac{1}{1 - ρ_{Y Z}^{2}}, (17.48)$

increases very rapidly as |ρY Z| → 1, where ρY Z = Corr(Y, Z). For example, $\frac{1}{1 - ρ^{2}} = \frac{4}{3} \approx 1.3$ for $| ρ | = \frac{1}{2}$ , but if |ρ| = 0.99 then $\frac{1}{1 - ρ^{2}} \approx 50$ . This observation implies that a very high degree of correlation between the original estimator and a control variate is required to yield a substantial variance reduction.

If a nonoptimal value of b is used, then the controlled estimator may have a larger variance than the original one, as is demonstrated in the following example. Let Z be a control variate for Y so that $Cov (Y, Z) > \frac{1}{2} Var (Z)$ . The controlled estimator Y (1) = Y − (Z − μZ) has a smaller variance than Y :

$Var (Y (1)) = Var (Y) - 2 Cov (Y, Z) + Var (Z) < Var (Y) - Var (Z) + Var (Z) = Var (Y) .$

However, the variance of Y (−1) = Y + (Z − μZ) (i.e., replace Z by −Z) is larger than that of Y :

$Var (Y (- 1)) = Var (Y) + 2 Cov (Y, Z) + Var (Z) > Var (Y) + Var (Z) + Var (Z) > Var (Y) .$

Example 17.15.

Apply the control variate method to the integral

$I = \int_{{(0, 1)}^{d}} (e^{x_{1} + x_{2} + \dots + x_{d}} - 1) d x_{1} d x_{2} \dots d x_{d} = {(e - 1)}^{d} - 1.$

Assume that a uniform sampling distribution is used in the direct Monte Carlo method. Calculate the variance reduction factor for the optimal control variate when d = 1, 5, 10.

Solution. The PDF of the uniform distribution in the unit hypercube (0, 1)d is $f (x) = I_{{(0, 1)}^{d}} (x)$ . The direct estimator of the integral is Y = eU1+U2+···+Ud −1, where U1, U2, . . . , Ud are i.i.d. Unif(0, 1)-distributed random variables. Applying Taylor’s formula to the integrand gives us the following approximation: $e^{x_{1} + x_{2} + \dots + x_{d}} - 1 \approx x_{1} + x_{2} + \dots + x_{d}$ . Let us use Z = U1 + U2 + · · · + Ud as a control variate. The expected value of Z is

$E [Z] = \int_{{[0, 1]}^{d}} (x_{1} + x_{2} + \dots + x_{d}) d x_{1} d x_{2} \dots d x_{d} = d \int_{0}^{1} x d x = \frac{d}{2} .$

So, the controlled estimator is

$Y (b) = Y - b (Z - E [Z]) = e^{U_{1} + U_{2} + \dots + U_{d}} - 1 - b (U_{1} + U_{2} + \dots + U_{d} - d / 2) .$

To find the optimal value of b and the respective variance reduction, we calculate Cov(Y, Z), Var(Y), and Var(Z):

$\begin{matrix} E [Y Z] = & E [(e^{U_{1} + U_{2} + \dots + U_{d}} - 1) (U_{1} + U_{2} + \dots + U_{d})] \\ = & d \int_{0}^{1} x e^{x} d x {(\int_{0}^{1} e^{y} d y)}^{d - 1} - d \int_{0}^{1} x d x = d {(e - 1)}^{d - 1} - d / 2, \\ Cov (Y, Z) = & E [Y Z] - E [Y] E [Z] = d {(e - 1)}^{d - 1} - \frac{d}{2} - ({(e - 1)}^{d} - 1) \cdot \frac{d}{2} \\ = & d {(e - 1)}^{d - 1} \cdot \frac{3 - e}{2}, \\ Var (Y) = & E [Y^{2}] - E {[Y]}^{2} = {(\int_{0}^{1} e^{2 x} d x)}^{d} - 2 {(\int_{0}^{1} e^{x} d x)}^{d} + 1 - {({(e - 1)}^{d} - 1)}^{2} \\ = & {(\frac{e^{2} - 1}{2})}^{d} - 2 {(e - 1)}^{d} + 1 - {(e - 1)}^{2 d} + 2 {(e - 1)}^{d} - 1 \\ = & {(\frac{e^{2} - 1}{2})}^{d} - {(e - 1)}^{2 d}, \\ Var (Z) = & Var (U_{1} + U_{2} + \dots + U_{d}) = d Var (U_{1}) = \frac{d}{12} . \end{matrix}$

Therefore, the optimal value of b is $b_{opt} = \frac{Cov (Y, Z)}{Var (Z)} = 6 {(e - 1)}^{d - 1} (3 - e)$ . The results for d = 1, 5, 10 are summarized in Table 17.7. As is seen from the table, the efficiency of this control variate method is decreasing with the growth of d.

Table 17.7

Efficiency of the control variate method.

d	1	5	10
bopt	1.69	14.73	220.71
Corr(Y, Z)	99.18%	91.38%	82.02%
Var(Y)/ Var(Y (bopt))	61.43	6.06	3.06

The above example suggests a universal approach to the construction of a control variate. Consider the estimation of Q = E[Y] with Y = h(X), X ~ f. Suppose that h: ℝ → ℝ has continuous derivatives and it can be approximated by the Taylor series expansion about x0:

$h (x) \approx \sum_{j = 1}^{k} \frac{h^{(j)} (x_{0})}{j!} {(x - x_{0})}^{j} .$

Suppose that we are able to exactly calculate the moments E [(X − x0)j], 1 ≤ j ≤ k. Then, the random variable $Z : = \sum_{j = 1}^{k} \frac{h^{(j)} (x_{0})}{j!} {(X - x_{0})}^{j}$ can be used as a control variate. The point x0 should be selected in a way such that | Corr(Y, Z)| is maximized.

17.5.5 Antithetic Variate

The antithetic variate method attempts to reduce the variance by exploiting a symmetry of a probability distribution. Consider the estimation of Q = E[h(U)], where U ~ Unif(0, 1). For example, $Q = \int_{a}^{b} g (x) d x = E [h (U)]$ with h(U) = (b − a)g(a + (b − a)U). Since U and 1 − U have the same distribution, the estimators h(U) and h(1 − U) are equally distributed as well. The antithetic estimator is defined as $Y_{anti} = \frac{1}{2} (h (U) + h (1 - U))$ . Compare the variances of the direct estimator Y = h(U) and the antithetic estimator Yanti:

$\begin{matrix} Var (Y_{anti}) = \frac{1}{4} (Var (h (U)) + 2 Cov (h (U), h (1 - U)) + Var (h (1 - U))) \\ = \frac{1}{2} Var (h (U)) + \frac{1}{2} Cov (h (U), h (1 - U)) . \end{matrix}$

If h(U) and h(1 − U) are negatively correlated, then $Var (Y_{anti}) \leq \frac{1}{2} Var (Y)$ . Since every realization of Yanti requires two evaluations of h, the computational time is doubled when we apply the antithetic variate method. However, we can still expect a reduction in the computational cost if Var(Yanti) is less than Var(Y) by half. This is the case if h is a monotone function, as is proved in the following proposition.

Proposition 17.17.

Let h be a monotone function defined on [0, 1]; let Cov(h(U), h(1−U)), where U ~ Unif(0, 1), be finite. Then, Cov(h(U), h(1 − U)) ≤ 0.

Proof. Without loss of generality, assume that the function h is nondecreasing. We need to show that

$E [h (U) h (1 - U)] \leq E {[h (U)]}^{2} .$

Since h is nondecreasing on [0, 1], the value Q lies between the values of h at the endpoints on the unit interval, i.e., h(0) ≤ Q ≤ h(1). Consider the function $f (y) : = \int_{0}^{y} h (1 - x) d x - Q y$ defined on the unit interval, [0, 1]. The function f is zero at the endpoints, f(0) = f(1) = 0. Its derivative, f′(y) = h(1 − y) − Q, is a nonincreasing function. Since f′(0) = h(1) − Q ≥ 0 and f′(1) = h(0) − Q ≤ 0, the function f is nonnegative everywhere on [0, 1]. Therefore, $\int_{0}^{1} f (y) h^{'} (y) d y \geq 0$ . Integration by parts gives

$\int_{0}^{1} f (y) h^{'} (y) d y = f (y) h (y) |_{0}^{1} - \int_{0}^{1} f^{'} (y) h (y) d y = - \int_{0}^{1} f^{'} (y) h (y) d y .$

Therefore, we have

$\int_{0}^{1} f^{'} (y) h (y) d y = \int_{0}^{1} (h (y) h (1 - y) - h (y) Q) d y = \int_{0}^{1} h (y) h (1 - y) d y - Q^{2} \leq 0.$

Hence, $\int_{0}^{1} h (y) h (1 - y) d y \leq Q^{2}$ .

In summary, the antithetic variate method attempts to reduce the variance by introducing negative correlation between pairs of realizations. Here are some examples of antithetic pairs.

U and 1 − U are both Unif(0, 1)-distributed. Corr(U, 1 − U) = −1.
Let F be a CDF and F−1 its generalized inverse. The random variables F−1(U) and F−1(1 − U) have the same CDF F , and they are antithetic to each other (F and F−1 are both monotone functions).
Z and −Z are both standard normal variables, and Corr(Z, −Z) = −1.
Z and 2μ − Z, where Z ~ Norm(μ, σ2), form an antithetic pair.

17.5.6 Conditional Sampling

Consider the estimation of E[h(X)] where h is a function of a random variable X. Suppose that there exists another random variable V correlated with X so that the conditional expectation $\hat{h} (V) = E [h (X) | V]$ can be calculated exactly. Applying the double expectation formula gives us

$E [h (X)] = E [E [h (X) | V]] = E [\hat{h} (V)] .$

As is well known from a standard course on probability theory, for any pair of square-integrable random variables (Y, V), we have Var(Y) = E[Var(Y | V)] + Var(E[Y | V]). Setting Y = h(X) gives that

$Var (h (X)) = E [Var (h (X) | V)] + Var (E [h (X) | V]) \geq Var (E [h (X) | V]) .$

Therefore, using the estimator $\hat{h} (V) = E [h (X) | V]$ instead of h(X) always leads to a variance reduction.

Algorithm 17.18 The Conditional Sampling Method.

(1) Generate n independent samples v1, v2, . . . , vn.
(2) Calculate $\hat{h} (υ_{i}) = E [h (X) | V = υ_{i}], i = 1, 2, ..., n$ .
(3) Estimate $Q = E [h (X)] \approx \frac{1}{n} \sum_{i = 1}^{n} \hat{h} (υ_{i})$ .

For this algorithm to be of practical use, the following conditions must be met.

(a) It is easy to generate V .
(b) E[h(X) | V = v] is readily computable for all v.
(c) E[Var(h(X) | V)] is large relative to Var(E[h(X) | V]).

Example 17.16.

Let {Xi}i≥1 be i.i.d. random variables with a common CDF F; let R be a positive-integer-valued random variable independent of all Xi. Apply the conditional sampling method to estimate

$q (x) = ℙ (S_{R} \leq x) = E [I_{{S_{R} \leq x}}], where S_{R} = \sum_{i = 1}^{R} X_{i} .$

Solution. We can try to improve the direct Monte Carlo estimator by isolating X1:

$\begin{matrix} ℙ (S_{R} \leq x | R = r) = ℙ (X_{1} \leq x - \sum_{i = 2}^{r} X_{i}) \\ = E [ℙ (X_{1} \leq x - \sum_{i = 2}^{r} X_{i} | X_{2}, X_{3}, ...)] = E [F (x - \sum_{i = 2}^{r} X_{i})], \end{matrix}$

where the mathematical expectation is relative to X2, X3, . . . . Therefore,

$q (x) = E [ℙ (S_{R} \leq x | R)] = E [E [F (x - \sum_{i = 2}^{R} X_{i})]],$

where the external expectation is relative to R and the internal expectation is taken with respect to X2, X3, . . .. This method is efficient if ℙ(R = 1) is large.

17.5.7 Stratified Sampling

Stratified sampling is a special case of conditional sampling. Consider again the estimation of Q = E[h(X)]. We construct V such that Q = E[E[h(X) | V]] by stratifying the support of X as follows. Introduce mutually disjoint subsets A1, A2, . . . , Ak so that ℙ(X ∊ [Ui≥1Ai) = 1. Define a discrete random variable V ∊ {1, 2, . . . , k} so that V = i if X ∊ Ai, 1 ≤ i ≤ k. Set pi = ℙ(V = i) = ℙ(X ∊ Ai), 1 ≤ i ≤ k. The quantity of interest Q is given as a double expectation:

$Q = E [h (X)] = \sum_{i = 1}^{k} p_{i} E [h (X) | X \in A_{i}] = E [E [h (X) | V]] .$

The stratified sample estimator of Q is

${\bar{H}}_{n}^{start} = \sum_{i = 1}^{k} \frac{p_{i}}{n_{i}} \sum_{j = 1}^{n_{i}} h (X_{i, j}),$

where $X_{i, 1, ...,} X_{i, n_{i}}$ are i.i.d. samples of X conditional on V = i, for 1 ≤ i ≤ k. The total sample size is n = n1 +n2 +· · ·+nk. A crucial assumption is that sampling of X conditional on V is possible. The variance of the stratified estimator is

$Var ({\bar{H}}_{n}^{start}) = \sum_{i = 1}^{k} \frac{p_{i}^{2}}{n_{i}^{2}} \sum_{j = 1}^{n_{i}} σ_{1}^{2} = \sum_{i = 1}^{k} \frac{p_{i}^{2} σ_{i}^{2}}{n_{i}}, (17.49)$

where $σ_{i}^{2} = Var (h (x) | V = i)$ .

The most important question is how to select optimal allocation. The following two results provide a simple method and an optimal method for choosing the sample sizes ni, 1 ≤ i ≤ k, respectively.

Proposition 17.18.

Let the sample sizes ni be proportional to the probabilities pi, i.e., ni = n · pi, for i = 1, 2, . . . , k. Then, the variance of the stratified sample estimator ${\bar{H}}_{n}^{s t r a t}$ does not exceed that of the direct sample estimator ${\bar{H}}_{n}^{d i r e c t}$ :

$Var (\sum_{i = 1}^{k} \frac{p_{i}}{n_{i}} \sum_{j = 1}^{n_{i}} h (X_{i, j})) \leq Var (\frac{1}{n} \sum_{i = 1}^{k} \sum_{j = 1}^{n_{i}} h (X_{i, j})),$

where n = n1 + n2 + · · · + nk.

Proof.

$\begin{matrix} n \cdot Var ({\bar{H}}_{n}^{strat}) = n \cdot \sum_{i = 1}^{k} \frac{p_{i}^{2} σ_{i}^{2}}{n_{i}} = \sum_{i = 1}^{k} p_{i} σ_{i}^{2} (n p_{i} = n_{i} \Rightarrow n p_{i}^{2} / n_{i} = p_{i}) \\ = E [Var (h (X) | V)] \leq Var (h (X)) = n \cdot Var ({\bar{H}}_{n}^{direct}) . \end{matrix}$

Proposition 17.19.

The optimal allocations are $n_{i}^{*} = n \frac{p_{i} σ_{i}}{\sum_{j = 1}^{k} p_{j} σ_{j}}$ , which give

$Var ({\bar{H}}_{n}^{start}) = \frac{1}{n} {(\sum_{i = 1}^{k} p_{i} σ_{i})}^{2} .$

Proof. The proof is left as an exercise for the reader (see Exercise 17.35).

The standard deviations σi, 1 ≤ i ≤ k, are usually unknown. In practice, one may estimate them from “pilot” runs and then estimate the optimal sample sizes $n_{1}^{*}, n_{2}^{*}, . . ., n_{k}^{*}$ .

A typical problem where the stratified sampling method is efficient is numerical integration. Consider an integral of some function g on D ⊆ ℝd. Introduce a partition of D into k disjoint subdomains:

$D = \cup_{i = 1}^{k} D_{i}, D_{i} \cap D_{j} = ϕ for i \neq j .$

Apply the stratified sampling method to the integral to obtain

$\begin{matrix} \int_{D} g (x) dx = \sum_{i = 1}^{k} \int_{D_{i}} g (x) dx = \sum_{i = 1}^{k} \int_{D_{i}} h (x) f_{X} (x) dx (where h (x) : = \frac{g (x)}{f_{X} (x)}) \\ = \sum_{i = 1}^{k} p_{i} \int_{D_{i}} h (x) f_{i} (x) dx =E [E [h (X) | V]], \end{matrix}$

where V = i with probability $p_{i} = \int_{D_{i}} f (x) d x$ for 1 ≤ i ≤ k, and $f_{i} (x) : = \frac{f_{X} (x)}{p_{i}} I_{D_{i}} (x)$ is the conditional density of X given V = i for 1 ≤ i ≤ k.

Example 17.17.

Construct a direct sample estimator and a stratified sample estimator with two equal subintervals for the integral $I = \int_{0}^{1} e^{x} d x$ . Use 10 sample points for both estimators. Compare the variances of the estimators obtained.

Solution. Let us construct an estimator using a uniform sampling distribution, i.e., I = E[eU] with U ~ Unif(0, 1). The direct Monte Carlo estimator with n = 10 sample points is

${\bar{H}}_{10}^{direct} = \frac{1}{10} (e^{U_{1}} + \dots + e^{U_{10}}),$

where U1, . . . , U10 are i.i.d. Unif(0, 1)-distributed random variables. The variance of the direct MC estimator is

$Var ({\bar{H}}_{10}^{direct}) = \frac{1}{10} Var (e^{U_{1}}) = \frac{1}{10} (\int_{0}^{1} e^{2 x} d x - I^{2}) = \frac{4 e - e^{2} - 3}{20} ≅ 0.2421.$

Divide the integration interval into two subintervals (k = 2): (0, 1) = (0,1/2) ∪ [1/2, 1). Now for each interval we need to find the conditional distribution of U and the probability, as is discussed above. The distribution of U ~ Unif(0, 1) conditional on {U ∊ (0, 1/2)} (or {U ∊ [1/2, 1)}) is uniform:

$(U | {U \in (0, \frac{1}{2})}) \overset{d}{=} \frac{U}{2} ~ U n i f (0, \frac{1}{2}), (U | {U \in [\frac{1}{2}, 1)}) \overset{d}{=} \frac{U + 1}{2} ~ U n i f (\frac{1}{2}, 1) .$

The probabilities are p1 = ℙ(U ∊ (0, 1/2)) = 1/2 and p2 = ℙ(U ∊ [1/2, 1)) = 1/2. Let n1 and n2 be sample sizes so that n1 + n2 = 10. The stratified sample estimator is

${\bar{H}}_{10}^{start} = \frac{1}{2 n_{1}} (e^{U_{1} / 2} + \dots + e^{U_{n_{1}} / 2}) + \frac{1}{2 n_{2}} (e^{(V_{1} + 1) / 2} + \dots + e^{(V_{n_{2}} + 1) / 2}),$

where $U_{1}, ..., U_{n_{1}}$ and $V_{1}, ..., V_{n_{2}}$ are i.i.d. U(0, 1)-distributed random variables. The variance of the stratified estimator is

$\begin{matrix} Var ({\bar{H}}_{10}^{start}) = \frac{1}{4 n_{1}} Var (e^{U_{1} / 2}) + \frac{e}{4 n_{2}} Var (e^{V_{1} / 2}) \\ = (\frac{1}{4 n_{1}} + \frac{e}{4 n_{2}}) Var (e^{U_{1} / 2}) = (\frac{1}{4 n_{1}} + \frac{e}{4 n_{2}}) (\int_{0}^{1} e^{x} d x - {(\int_{0}^{1} e^{x / 2} d x)}^{2}) \\ = (\frac{1}{4 n_{1}} + \frac{e}{4 n_{2}}) (8 \sqrt{e} - 3 e - 5) ≅ \frac{0.008731}{n_{1}} + \frac{0.02373}{n_{2}} . \end{matrix}$

If $n_{1} = n_{2} = 5$ , then Var $({\bar{H}}_{10}^{strat}) ≅ 0.006493 < 0.02421 ≅ Var ({\bar{H}}_{10}^{direct})$ .

17.6 Exercises

Exercise 17.1. In the “hit-or-miss” method, the value of π is estimated as follows. A point is sampled uniformly in the square [−1, 1] × [−1, 1], i.e., sample two independent Cartesian coordinates X, Y ~ Unif(−1, 1). A point (X, Y) is accepted if it lies inside the circle inscribed in the square, i.e., X2 + Y2 < 1. The experiment is repeated N times, and the number of accepted points, NH, is recorded. The ratio $\frac{N_{H}}{N}$ converges to a limiting value whose expression involves π.
1. (a) Construct a random estimator of π based on the experiment described above.
2. (b) Construct a sample estimate of π with N trials. Express it in terms of the ratio $\frac{N_{H}}{N}$ .
3. (c) Construct a 99% confidence interval for π. Find the number of trials, N, required to obtain the confidence interval of length less than 10−3.
Exercise 17.2. Let U ~ Unif(0, 1) admit the following binary representation:

$U = {(0. B_{1} B_{2} ... B_{k} ...)}_{2} = \sum_{k = 1}^{\infty} B_{k} \cdot 2^{- k} .$

Show that the digits Bk, k = 1, 2, . . ., are independent Bernoulli random variables uniformly distributed in {0, 1}.
Exercise 17.3. Propose an algorithm for simulating the occurrence of two dependent events A and B that uses only one uniform random variable U ~ Unif(0, 1). Assume that the probabilities ℙ(A) = pA, ℙ(B) = pB, and ℙ(A ∪ B) = q are given.
Exercise 17.4. Using the inverse CDF method, find generating formulae for the following PDFs:
1. (a) $f (x) = \frac{3 e^{- x}}{(e^{- x} + 3) \sqrt{9 - e^{- 2 x}}}, - \ln 3 < x < + \infty$ ;
2. (b) $f (x) = \frac{2 \cos x}{\sin^{3} x}, \frac{π}{4} \leq x \leq \frac{π}{2}$ ;
3. (c) $f (x) = \frac{8 x}{{(x + 1)}^{3}}, 0 \leq x \leq 1$ ;
4. (d) $f (x) = x \sin x^{2}, 0 \leq x \leq \sqrt{π}$ ;
5. (e) $f (x) = \frac{3}{2 {(2 x + 1)}^{3 / 2}}, 0 \leq x \leq 4$ ;
6. (f) $f (x) = \frac{1}{\cos^{2} x}, 0 \leq x \leq \frac{π}{4}$ .
Exercise 17.5. Using the inverse CDF method, find generating formulae for the Rayleigh distribution with the CDF F (x) = 1 − e−2x(x−b), x > b.
Exercise 17.6. Let X be an absolutely continuous random variable such that the inverse function of its distribution function F is well-defined. Consider the problem of sampling X conditional on a < X < b, with F (a) < F (b).
1. (a) Find the CDF for such a conditional distribution.
2. (b) Show that the generating formula X = F−1(F (a) + (F(b) − F(a))U) with U ~ Unif(0, 1) produces the desired conditional distribution.
Exercise 17.7. Justify the following sampling formula for the geometric random variable X with parameter p ∊ (0, 1) (i.e., ℙ(X = k) = (1 − p)k−1p for k = 1, 2, . . .):

$X = ⌊ \frac{\ln (1 - U)}{\ln (1 - p)} ⌋ + 1, U \in U n i f (0, 1) .$

[Hint: Evaluate the probability ℙ(X = k), k = 1, 2, . . ..]
Exercise 17.8. Consider a random variable with the piecewise-constant PDF

$f (x) = {\begin{array}{l} \frac{1}{2}, 0 < x \leq 1, \\ \frac{1}{8}, 1 < x \leq 3, \\ \frac{1}{12}, 3 < x < 6. \end{array}$

Design a simulation algorithm using the inverse-transform method.
Exercise 17.9. Consider a random variable with a PDF f, which is proportional to the piecewise function

$g (x) = {\begin{array}{l} \frac{x}{3}, 0 < x \leq 2, \\ \frac{1}{3}, 2 < x \leq 4, \\ 1 - \frac{x}{6}, 4 < x < 6. \end{array}$

Find the constant c such that f(x) = cg(x) is a probability density. Design a simulation algorithm using the composition method.
Exercise 17.10. Present an acceptance-rejection method for a random variable on (0, 1) with the PDF of the form

$f (x) = g (x) / x^{a}, 0 < g (x) \leq M, 0 < a < 1.$
Exercise 17.11. Construct the Neumann method (and estimate its computational cost) for sampling a random variable X whose density is proportional to
1. (a) $f (u) = 3 - \sqrt[3]{2 u}, 0 \leq u \leq 1$ ;
2. (b) f(u) = −u2 + 2 u + 3 , 0 < u < 3;
3. (c) f(u) = u5/3 (1 − u)3/2, 0 ≤ u ≤ 1;
4. (d) f(u) = u5/3 e−u, u > 0.
For each case, compute the probability of acceptance.
Exercise 17.12. Consider the triangular probability distribution with the PDF

$f (x) = \frac{x}{2} I_{(0, 1]} (x) + \frac{4 - x}{6} I_{(1, 4)} (x) .$
1. (a) Obtain the CDF F and then its inverse F−1. Describe the inverse CDF sampling method.
2. (b) Develop the decomposition method using the strata {(0, 1], (1, 4)}.
3. (c) Determine which of the following majorizing functions provides the largest acceptance probability in the acceptance-rejection method:
  
  $g_{1} (x) = \frac{x}{2}, g_{2} (x) = \frac{4 - x}{6}, g_{3} (x) = \frac{1}{2}, x \in (0, 4) .$
  
  Find the acceptance probability for each case.
Exercise 17.13. Let X be a positive random variable with the probability density function $f (x) = \sqrt{\frac{2}{π σ^{2}}} e^{- x^{2} / (2 σ^{2})}$ , i.e., X = |Z| where Z ~ Norm(0, σ2). Develop an acceptance-rejection algorithm for generating a random variable from the PDF f using the exponential distribution Exp(λ) as the majorizing distribution. Which λ gives the largest acceptance probability (give the answer in terms of σ)?
Exercise 17.14. Consider a bivariate distribution with the joint density

$f (x, y) = {\begin{array}{l} c x y if x > 0, y < 1, y - x > 0, \\ 0 otherwise . \end{array}$
1. (a) Find the constant c.
2. (b) Represent the joint density as a product of marginal and conditional univariate densities.
3. (c) Construct the exact simulation algorithm based on the inverse CDF method. Consider the following approach: first model X and then Y.
Exercise 17.15. The density of a random variable X is represented in the integral form:

$f_{X} (u) = c \int_{1}^{\infty} υ^{- c} e^{- u υ} d υ, u > 0, c > 0.$
1. (a) Show that fX is a density function.
2. (b) For the joint density
  
  $f_{X, Y} (u, υ) = c υ^{- c} e^{- u υ}, u > 0, υ > 1,$
  
  find a representation of the form fX,Y (u, v) = fY (υ)fX|Y (u|v).
3. (c) Based on the above representation of the marginal density
  
  $f_{X} (u) = \int_{- \infty}^{\infty} f_{Y} (υ) f_{X | Y} (u | υ) d υ,$
  
  construct an algorithm for generating X, first by sampling Y from fY (υ), and then by sampling from the conditional density fX|Y (u|v).
Exercise 17.16. Demonstrate how three independent tosses of a balanced coin can be modelled by two rolls of a balanced die (with six faces). In other words, develop an exact algorithm for sampling a vector [X1, X2, X3] formed by three independent discrete random variables uniformly distributed in {0, 1} by using two independent discrete random variables Y1 and Y2 uniformly distributed in {1, 2, 3, 4, 5, 6}.
Exercise 17.17. Develop an algorithm that allows you to “replace” a fair roulette wheel (with 37 pockets numbered from 0 to 36) by a balanced die (with six faces labelled from 1 to 6). In other words, develop an exact algorithm for sampling a discrete random variable X uniformly distributed in {0, 1, 2, . . . , 36} by using a sequence (Yk)k≥0 of i.i.d. discrete random variables uniformly distributed in {1, 2, 3, 4, 5, 6}.
Exercise 17.81. Consider acceptance-rejection sampling from a standard normal distribution.
1. (a) Find the range for λ > 0 and c > 0 so that the function
  
  $g (x) = {\begin{array}{l} 1 if | x | < c, \\ e^{- λ (| x | - c)} if | x | \geq c, \end{array}$
  
  majorizes $e^{- x^{2} / 2} on (- \infty, \infty)$ .
2. (b) Calculate the acceptance probability (as a function of c and λ).
3. (c) Let us set $c = 1 / \sqrt{2}$ and $λ = 2 c = \sqrt{2}$ . Show that this choice of parameters maximizes the acceptance probability equal to $\frac{\sqrt{π}}{2}$ .
4. (d) Show that the proposal CDF G is given by
  
  $G (x) = {\begin{array}{l} \frac{1}{4} e^{\sqrt{2 x} + 1} if x < - \frac{\sqrt{2}}{2}, \\ \frac{\sqrt{2 x} + 2}{4} if | x | \leq \frac{\sqrt{2}}{2}, \\ 1 - \frac{1}{4} e^{- \sqrt{2 x} + 1} if x > \frac{\sqrt{2}}{2} . \end{array}$
5. (e) Develop computational algorithms
  1. (i) for sampling from the inverse of G;
  2. (ii) for sampling from the standard normal distribution by using the acceptance-rejection method with the proposal CDF G.
Exercise 17.19. Construct two methods of sampling from the trivariate normal distribution

$N o r m_{3} ([\begin{matrix} 1 \\ 0 \\ - 1 \end{matrix}], [\begin{matrix} 1 & 1 & 1 \\ 1 & 4 & 1 \\ 1 & 1 & 9 \end{matrix}]) .$

using the Cholesky factorization and the conditional sampling approach, respectively.
Exercise 17.20. By applying the conditioning formula for a multivariate normal distribution, find the conditional (bridge) distribution of a Brownian motion with drift

$B (t) = μ t + σ W (t), t \geq 0.$

In particular, show that for every 0 ≤ s < u < t, B(u) conditional on values of B(s) and B(t) has a normal distribution. Find the mean and variance of this conditional distribution.
Exercise 17.21. Consider a time-homogeneous random process {X(t)}t≥0 with the transition PDF p(t; y, x) defined by

$p (t; y, x) d x = ℙ (X (t + s) \in [x, x + d x) | X (s) = y) for t, s > 0$

with infinitesimally small dx. Using the notion of conditional probability, show that for every 0 ≤ t1 < t < t2 the bridge density b(t; x|t1, t2; x1, x2) of X(t) conditional on X(t1) = x1 and X(t2) = x2 is given by

$b (t; x | t_{1}, t_{2}; x_{1}, x_{2}) = \frac{p (t - t_{1}; x_{1}, x) p (t_{1} - t; x, x_{2})}{p (t_{2} - t_{1}; x_{1}, x_{2})} .$
Exercise 17.22. To simplify the Euler scheme, the Brownian increments ΔW = W (t + h) − W (t) can be replaced by other random variables $\hat{Δ W}$ with moments up to order 5 that are within $O (h^{3})$ of those of ΔW. Find the moments of the three point distribution

$P (\hat{△ W} = \pm \sqrt{3 h}) = \frac{1}{6}, P (\hat{△ W} = 0) = \frac{2}{3}$

and compare them with those of ΔW.
Exercise 17.23. Solve the problem of minimizing the mean-square error subject to a computational budget s as follows:

$\frac{C_{2}}{n} + C_{1}^{2} h^{2 β} \to \min subject to \frac{n C_{3}}{h} = s,$

where h is the discretization parameter, n is the sample size, and C1, C2, C3 are positive constants. Find the optimal values of h and n. Find the order (w.r.t. s) of the optimal MSE.
Exercise 17.24. Let a random variable Y satisfy 0 ≤ m1 ≤ Y ≤ m2 < ∞ a.s. Show that the variance of Y satisfies

$Var (Y) \leq {(m_{2} - m_{1})}^{2} / 4.$

[Hint: Consider $E [{(Y - \frac{m_{1} + m_{2}}{2})}^{2}]$ .]
Exercise 17.25. Construct an importance sampling Monte Carlo method for evaluating the integral

$I = \int_{0}^{π} \dots \int_{0}^{π} \cos (1 + \sin (u_{1} ... u_{6})) \sin u_{1} ... \sin u_{6} d u_{1} \dots d u_{6} .$

Choose a nonconstant density function f so that it is close enough to the integrand and, on the other hand, can be easily sampled from.
Exercise 17.26. Suppose that the integral $I = \int_{0}^{1} e^{x}$ dx is evaluated by the Monte Carlo method using the density f(x) ∞ (1 + cx), c > 0. Find the optimal value of c so that the variance of the standard estimator eX/f(X), X ~ f, attains its minimum value.
Exercise 17.27. Show the advantage of using the control variate method for evaluating the integral

$I = \int_{0}^{1} \dots \int_{0}^{1} \sqrt{1 + u_{1}} ... \sqrt{1 + u_{8}} d u_{1} \dots d u_{8} .$

Use the first two terms of a Taylor series of $\sqrt{1 + u}$ to construct the control variate.
Exercise 17.28. Let the integral $I = \int_{0}^{1} \sin (\frac{π x}{2})$ dx be evaluated using the Monte Carlo method.
1. (a) Suppose that a combination of stratified random sampling and an antithetic variate method of the form
  
  $Y = \frac{1}{2} (f (\frac{U}{2}) + f (1 - \frac{U}{2})), U ~ U n i f (0, 1),$
  
  where f(x) = sin(πx/2), is applied to evaluate I = E[Y]. What is the efficiency gain of this estimator as compared to the direct Monte Carlo estimator X = f(U), U ~ Unif(0, 1).
2. (b) How large a sample size do you need if you use the direct Monte Carlo or the antithetic stratified method of (a), respectively, in order to estimate the above integral, correct to four decimal places (i.e., the error does not exceed 10−4/2) with a confidence level of 95%?
Exercise 17.29. Suppose the integral

$I = \int_{0}^{2} x^{m} d x for m = 2, 3, ...$

is approximated by the Monte Carlo method.
1. (a) Find the variance of the direct Monte Carlo estimator X = 2f(U), U ~ Unif(0, 2), where f(x) = xm.
2. (b) Suppose that a combination of stratified random sampling and an antithetic variate method of the form Y = f(U) + f(2 − U), U ~ Unif(0, 2), is applied to evaluate the integral.
  1. (i) Show that I = E[Y].
  2. (ii) Find the variance Var(Y).
  3. (iii) What is the efficiency gain of this estimator as compared to the direct estimator?
  4. (iv) How does the efficiency gain behave with the increase of m?
3. (c) Suppose Z = U is used as a control variate. Give the formula of the controlled estimator. Find the variance reduction factor for the optimal controlled estimator in comparison with the direct Monte Carlo estimator. How does the efficiency gain behave with the increase of m? [Note: You are not required to compute the optimal value of the control variate parameter.]
4. (d) How large a sample size do you need if you use the direct Monte Carlo or the control variate method of (c), respectively, to approximate the above integral, correct to two decimal places (i.e., the absolute error does not exceed 10−2/2) with a confidence level of 95%?
Exercise 17.30. Let R ~ Geom(p), and X1, X2, . . . be i.i.d. Exp(λ) random variables. Find the distribution of $S_{R} = \sum_{i = 1}^{R} X_{i}$ .
Exercise 17.31. Construct an importance sampling Monte Carlo method for evaluating the integral

$I = \int_{0}^{1} \dots \int_{0}^{1} \ln (3 + u_{1} u_{2} \dots u_{11}) u_{1}^{2} u_{2}^{2} \dots u_{11}^{2} d u_{1} d u_{2} \dots d u_{11} .$

Choose the density function f so that it is close enough to the integrand and, on the other hand, can be easily sampled from. A constant density is not acceptable!
Exercise 17.32. Show the advantage of using the control variate method for evaluating the integral

$I = \int_{0}^{π / 2} \int_{0}^{π / 2} \dots \int_{0}^{π, 2} \sin u_{1} \sin u_{2} \dots \sin u_{5} d u_{1} d u_{2} \dots d u_{5} .$

Use u1 u2 · · · u5 to construct a control variate.
Exercise 17.33. Prove that for any pair of random variables (U, V),

$Var (U) = E [Var (U | V)] + Var (E [U | V])$

[Hint: Use the fact that E[U2] = E[E[U2|V]] and Var(U) = E[U2] − (E[U])2.]
Exercise 17.34. Let X1, X2, . . . , Xn be independent random variables with expected values E[Xi] = μi and nonzero variances. Consider the following estimator of E[Y]:

$Z = Y + \sum_{i = 1}^{n} c_{i} (X_{i} - μ_{i}) .$

Show that the values of c1, c2, . . . , cn that minimize Var(Z) are

$c_{i} = - \frac{Cov (Y, X_{i})}{Var (X_{i})}, i = 1, 2, ..., n .$
Exercise 17.35. Show that the solution to the minimization problem

$\min_{n_{1}, ..., n_{k}} \sum_{i = 1}^{k} \frac{p_{i}^{2} σ_{i}^{2}}{n_{i}}, such that n_{1} + n_{2} + \dots + n_{k} = N,$

is given by

$n_{i} = N \frac{p_{i} σ_{i}}{Σ_{j = 1}^{k} p_{j} σ_{j}}, i = 1, 2, ..., k .$

[Hint: Use Lagrange multipliers.]
Exercise 17.36. Show that the solution of

$\arg \min_{g} {Var}_{g} (h (X) \frac{f (X)}{g (X)}), X ~ g,$

where g is a PDF, is given by

$g (x) = \frac{| h (x) | f (x)}{\int_{- \infty}^{\infty} | h (x) | f (x) d x} .$

References

I. G. Dyadkin and G. H. Kenneth. A study of 128-bit multipliers for congruential pseudo-random number generators. Computer Physics Communications, 125(13):239–258, 2000.

S. M. Ermakov. Metod Monte-Karlo i smezhnye voprosy [The Monte Carlo Method and Related Questions]. Izdat. “Nauka,” Moscow, 1975. Second edition, augmented, Teoriya Veroyatnostei i Matematicheskaya Statistika. [Monographs in Probability and Mathematical Statistics].

S. M. Ermakov and G. A. Mikhaĭlov. Statisticheskoe modelirovanie [Statistical Modelling]. “Nauka”, Moscow, second edition, 1982.

A. Gut. An Intermediate Course in Probability. Springer Texts in Statistics. Springer, New York, second edition, 2009.

P. E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Applications of Mathematics. Springer, 2011.

P. L’Ecuyer. Random number generation. In Handbook of Computational Statistics, pages 35–71. Springer, 2012.

D. H. Lehmer. Mathematical methods in large-scale computing units. In Proceedings of a Second Symposium on Large-Scale Digital Calculating Machinery, 1949, pages 141–146, Cambridge, MA, 1951. Harvard University Press.

D. B. Madan and E. Seneta. The Variance Gamma model for share market returns. Journal of Business, 63(4):511–524, 1990.

R. N. Makarov and D. Glew. Exact simulation of Bessel diffusions. Monte Carlo Methods and Applications, 16(3):283–306, 2010.

S. K. Park and K. W. Miller. Random number generators: good ones are hard to find. Commun. ACM, 31(10):1192–1201, October 1988.

A. V. Voitishek and G. A. Mikhaĭlov. Numerical Statistical Modeling: Monte Carlo Methods. Akademiya, Moscow, 2006.

A. J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, 3(3):253–256, 1977.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 17 Introduction to Monte Carlo and Simulation Methods

Create new playlist

Sign In

Sign Up

Introduction to Monte Carlo and Simulation Methods

17.1 Introduction

17.1.1 The “Hit-or-Miss” Method

17.1.2 The Law of Large Numbers

17.1.3 Approximation Error and Confidence Interval

17.1.4 Parallel Monte Carlo Methods

17.1.5 One Monte Carlo Application: Numerical Integration

17.2 Generation of Uniformly Distributed Random Numbers

17.2.1 Uniform Probability Distributions

17.2.2 Linear Congruential Generator

17.3 Generation of Nonuniformly Distributed Random Numbers

17.3.1 Transformations of Random Variables

17.3.2 Inversion Method

17.3.2.1 Inverse Distribution Function

17.3.2.2 The Chop-Down Search Method

17.3.2.3 The Binomial Search Method

17.3.3 Composition Methods

17.3.3.1 Mixture of PDFs

17.3.3.2 Randomized Gamma Distributions

17.3.3.3 The Alias Method by Walker

17.3.4 Acceptance-Rejection Methods

17.3.5 Multivariate Sampling

17.3.5.1 Sampling by Conditioning

17.3.5.2 The Box–Müller method

17.3.5.3 Simulation of Multivariate Normals

17.4 Simulation of Random Processes

17.4.1 Simulation of Brownian Processes

17.4.1.1 Sequential Sampling

17.4.1.2 Bridge Sampling

17.4.2 Simulation of Gaussian Processes

17.4.3 Diffusion Processes: Exact Simulation Methods

17.4.3.1 The Stochastic Calculus Approach

17.4.3.2 The PDF Approach

17.4.4 Diffusion Processes: Approximation Schemes

17.4.4.1 Types of Convergence

17.4.4.2 The Euler Scheme

17.4.4.3 Extrapolation

17.4.4.4 Error Analysis

17.4.5 Simulation of Processes with Jumps

17.4.5.1 Poisson Processes

17.4.5.2 Subordinated Processes

17.5 Variance Reduction Methods

17.5.1 Numerical Integration by a Direct Monte Carlo Method

17.5.2 Importance Sampling Method

17.5.3 Change of Probability Measure

17.5.4 Control Variate Method

17.5.5 Antithetic Variate

17.5.6 Conditional Sampling

17.5.7 Stratified Sampling

17.6 Exercises

References

Table of Contents for
Chapter 17 Introduction to Monte Carlo and Simulation Methods