Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3
Background for Markov Chain Monte Carlo

3.1 Randomization

3.1.1 Rolling Dice

It is a very old idea, even an ancient one: The idea that sometimes we lack information needed to make informed deterministic decisions, and therefore we resort to higher authorities. It happened in ancient Egypt and China, in pre-Columbian South and Central America [12], and was well-documented in Greece (e.g., the Oracle at Delphi) [6]. People used to consult the gods before making important decisions: to go to war or not to go to war, when to start harvesting, how to plan families, friendships, and strategic alliances. This communication with the higher powers was usually established through the institutes of priests equipped with special knowledge and special communication devices for communication with gods.

One of the older and more venerable devices for such communication is a die or a set of dice. In its modern version, a die is a six-sided cube with numbers or symbols on each of its sides. Older versions could be five sided (pichica of Incas [12]) or even double sided (a coin). Putting aside the esoteric nature of these objects, we can also consider them to be simple and handy pre-computer era tools of randomization.

Imagine a story from the old times of tribal wars. The Chief of a tribe, before raiding the neighbors’ lands, seeks the advice of the High Priest. Priests might not talk directly to the gods, but they are capable of collecting and analyzing information. Let us say that from a historical perspective, one of the six such raids normally ends with the death of the raiders’ leader. The Chief wants a deterministic forecast of the enterprise’s success or failure, and does not want to know anything about probabilities. So the High Priest rolls a die. If the number “six” appears on the uppermost side, he predicts the Chief’s death. If he observes a number from “one” to “five,” he announces the planned raid safe. His decision is based on a single randomized experiment. A valid prediction takes into account available statistics of the raids of the past. Moreover, additional information may be available to the High Priest: Let us say, the Chief is old and weakly. Chief’s age seems to increase the chances for a lethal end to the venture. On the other hand, “old” might also mean “wise and experienced,” which increases the chances of success. The degree of risk faced may depend on many other factors, which a good High Priest should take into account. However, in order to incorporate additional information, the Priest might need not a single die, but a more sophisticated randomization tool.

Think of your own experience. Have you ever tossed a coin and made a 50-50 decision based on the results of the toss? This is how we often deal with uncertainty in the situations when we need to make a definitive deterministic decision.

Computer-based randomization tools are glorified dice of the modern times. With the high speeds of operation and the extensive memory of modern computer devices, we routinely apply randomized procedures to generate multiple scenarios of the future. Stochastic simulation utilizing such scenarios assists us in decision-making in such fields as business, finance, medicine, and even such a rigorous discipline as engineering.

3.1.2 Two Envelopes Revisited

Let us review the problem of two envelopes introduced in the previous chapter to demonstrate the role of prior information. To use it for a different purpose, we have to rid this problem of its applied flavor and consider a rather abstract theoretical version [4]. Suppose you are given two sealed envelopes, one contains a number A, and the other contains a number B, which is different from A. You know nothing about A and B other than that these two are real numbers (including negative, positive, and zero), and no limits are imposed on their values. Both numbers are generated by our Opponent, say, Nature, using an undisclosed procedure, and then put into the envelopes. You have no idea how these two numbers are selected.

You are allowed to choose and open one of the envelopes, and read the number inside. Then you have to decide which of the numbers is larger: the known one you see in the open envelope or the unknown one contained in the envelope which stays sealed. The trick task is: you have to guarantee that for any numbers A and B that are chosen by the Opponent, you can guess the larger one with a probability strictly exceeding 0.5.

Let us formulate this task mathematically:

(3.1)

and then consider some simple strategies.

Fully Random Strategy: Always say that the number in the opened envelope is larger. Or always say that it is smaller. Or just toss a coin and do what it says. This way you will always achieve the probability of 0.5, but will never exceed it.
Intelligent Benchmark Strategy: Pick a number. Say, zero. Justification: Zero is “exactly in the middle of the real axis.” Use zero as your benchmark in the following decision.

If A > 0, then choose A;

if A < 0, then choose B.

Notice that for all you know (or rather do not know) about the procedure with which Nature selects A and B, for the sake of simplicity we can neglect the chance that any of the two numbers is exactly equal to 0.

Strategy II is not bad! As we can easily see, it gives no advantage if A and B are both negative or both positive.

(3.2)
However, if two numbers lie on different sides of zero,

(3.3)

If we introduce an additional (though unjustified) assumption that both A and B are independently equally likely to be positive or negative, than P(Guess) = 0.75, which would be really good. However, you have not achieved the goal: You still have not found a winning strategy which would work for any numbers A and B that are chosen by the Opponent. That means that if a smart opponent knows that you are going to use zero as the benchmark, he or she will always choose an A and a B of the same sign.

Notice that choosing any other fixed benchmark (positive or negative) instead of zero will bring about the same result if a smart opponent steals the value of your benchmark. That corresponds to the idea that any fixed number could be considered to be positioned “exactly in the middle” of the real axis.
Randomized Benchmark Strategy: Pick any random real number R.

If A > R, then choose A;

if A < R, then choose B.

As we can see, similarly to Strategy II,

(3.4)

However,

(3.5)

If the probability of R falling between A and B is positive P(R ∈ (A, B)) = p(A, B) > 0 (this should be achieved by an intelligent random choice of R),

(3.6)

for any numbers A and B that are chosen by the Opponent. The goal is attained!

We have to confess that there is still a question we tried to avoid so far. This question is: what exactly do we need to do to pick a random real number R so that it is guaranteed to fall between any real numbers A and B with a positive probability? Construction or generation of such random numbers is a problem we will consider now. The example above gives some motivation for the idea of randomization, which requires more thorough analysis of random number generation. Simulation of multiple future scenarios will also require procedures of generation of multiple random numbers with very specific properties.

The following section will contain basic references to the modern tools of random number generation. We will not forget about such classical randomization tools as a roll of a die or even a coin toss. However, the computer era opens up new opportunities.

3.2 Random Number Generation

3.2.1 Pseudo-random Numbers

Virtually any modern software involved with statistics is expected to be able to generate random numbers. To be exact, pseudo-random numbers. There exist various techniques of such procedures which can be reviewed in [15] or [13]. Microsoft Excel has a function = RAND() and R has a command “runif” which, when applied, activate a uniform pseudo-random number generator. In general, such a generator is an algorithm which starts from an initial value u₀: 0 < u₀ < 1 and uses a transformation D producing a sequence {u_i} = {Dⁱ(u₀)} of values in the interval [0, 1]. The transformation D, which we consider a black box (a secret of software developers) guarantees that for any n the sequence of values (u₁, u₂, …, u_n) reproduces the behavior of an i.i.d. sample (V₁, V₂, …, V_n) of uniform random variables V_i ∼ Unif[0, 1]. By construction, sequence {u_i} is anything but independent: each term is obtained directly from the same “seed value” u₀. However, for our purposes it is irrelevant if the resulting sequence supplies us with a uniform cover of the unit interval emulating the behavior of an independent sample. Special testing tools provide computational and statistical means to verify the performance of pseudo-random number generators, such as suggested in [16], [18], or [28].

Uniform pseudo-random number generator solves a simple problem: simulating an i.i.d. random sample with elements V_i ∼ Unif[0, 1]. What if we need a uniform sample on an arbitrary interval [a, b]? We will transform the sequence {u_i} into a sequence {a + (b − a)u_i}. It is easy to see that this new sequence will emulate the behavior of i.i.d. variables W_i ∼ Unif[a, b]. This follows from the fact that W_i = a + (b − a)V_i ∼ Unif[a, b].

Similar to this simple linear transformation from one uniform sample to another uniform sample defined on a different interval, we can consider more general transformations. They will allow us to transform initial uniform sample {u_i} into samples from many other distributions. We will review some basic techniques. For more details one can review classical sources [5, 10, 27]. For more recent development see [2], [7], [21], and [24], and also [8] and [14] with special accent on applications to finance and insurance.

3.2.2 Inverse Transform Method

Let us consider a distribution with continuous c.d.f. F(x). Our goal is to use a uniform sample {u_i} to build an i.i.d. sample from this distribution. For this purpose we will define for any u ∈ [0, 1] the generalized inverse for F(x) as F^{− 1}(u) = inf{x: F(x) ≥ u}.

Lemma 3.1 If U ∼ Unif[0, 1], then V = F^{− 1}(U) ∼ F.

Proof: For any u ∈ [0, 1] and x ∈ F^{− 1}([0, 1]) it is true that F(F^{− 1}(u)) ≥ u and F^{− 1}(F(x)) ≤ x. Therefore,

(3.7)

and

(3.8)

Thus the sequence {F^{− 1}(u_i)} provides us with an i.i.d. sample from F(x).

Here are four steps for the implementation of this procedure:

Write down the c.d.f. u = F(x).
Derive the inverse F^{− 1}(u).
Using standard “black box” procedure, generate a uniform sample {u_i}.
Using the inverse from step 2, transform the sample from step 3 into
(3.9)

Let us use a popular (and important) example of exponential distribution.

Exponential Distribution

To construct an i.i.d. sample from exponential distribution with distribution density function f(x) = λe^{− λx}, 0 ≤ x < ∞ with parameter λ > 0, we can follow the steps as demonstrated in Figure 3.1:

F(x) = ∫^x₀λe^{− λt}dt = 1 − e^{− λx}.
u = 1 − e^{− λx}⇒e^{− λx} = 1 − u⇒ − λx = ln (1 − u)⇒x = F^{− 1}(u) = −λ^{− 1}ln (1 − u).
Using standard procedure, generate a uniform sample (u₁, u₂, …, u_n).
Using the inverse from step 2, transform the sample from step 3 into

3.2.3 General Transformation Methods

The inverse transform method is not easy to carry out for all distributions of interest. The usual problem is with step 2: obtaining an analytical form of the inverse c.d.f. However, from Chapter 1 we know that there exist some relationships between distribution families facilitating our task. For instance, gamma distribution is related to the exponential (see Section 1.4), and a similar relationship exists between beta and gamma distributions. More details on relationships between different distribution functions may be found in [1]. Being able to efficiently generate samples from exponential distribution using inverse transform is not the end of the road. It also helps to deal with some other distributions. Simple examples of random number generation using transformation methods are summarized on the companion website in Appendices an Excel file pseudorandom.xlsx. For examples in R see [25].

Gamma and Beta Distributions for Integer Parameters

Suppose we can generate samples from exponential distribution {x_i} ∼ Exp(λ). Using the fact that if for integer α and independent X_j

(see Section 1.4), we can:

Generate α different samples from the exponential: (x_ij) ∼ Exp(λ), j = 1, …, α, i = 1, …, n.
Set y_i = ∑^α_{j = 1}x_ij ∼ Gamma(α, λ) for all i = 1, 2, …, n.

Similarly, for integer α and β we can also use (1.27) from Section 1.5 and do the following:

Generate α + β samples {x_ij} ∼ Exp(1).
Set for all i = 1, 2, …, n.

Normal Distribution (Box–Mueller Algorithm)

The following algorithm is one of the most popular methods to generate samples from standard normal distribution. This or similar methods are built in many standard software packages. We make use of the following representation for bivariate standard normal distribution (zero mean and unit covariance matrix) with two independent components

and then apply a two-step procedure:

Generate (u_i1, u_i2) ∼ Unif[0, 1], i = 1, …, n.
Define for all i = 1, 2, …, n
(3.10)

A little extra work, but as a result we obtain two independent samples from standard normal distribution (or one larger sample of size 2n). This is our first illustration of using a higher-dimensional technique to solve an essentially one-dimensional problem. Such examples are common for the implementation of Monte Carlo methods.

Discrete Distributions on Finite Sample Spaces

Suppose we need to generate data from a discrete distribution. For the sake of simplicity, assume that all possible values are integers and there is a finite number of values p_k = P(X = k), k = 1, 2, …, m.

Generate a sample (u₁, u₂, …, u_n) ∼ Unif[0, 1].
Assign for all i = 1, 2, …, n

x_i = 1, if 0 ≤ u_i < p₁,

x_i = 2, if p₁ ≤ u_i < p₁ + p₂,

x_i = 3, if p₁ + p₂ ≤ u_i < p₁ + p₂ + p₃,

…

x_i = m if 1 − p_m ≤ u_i < 1.

The sample (x₁, …, x_n) has the desired discrete distribution. This generator will also work for any noninteger values, but for this construction it is important that the sample space is finite.

Two final examples in this subsection are less straightforward. They require more work and are not always very efficient. Moreover, the performance of these two capricious random number generators depends on circumstances and cannot be fully predicted in advance.

Poisson Distribution

In a Poisson process, times between two events are distributed exponentially as mentioned in Section 1.4. It can be expressed as the following statement:

indicating that one needs to add up a random Poisson number of exponential variables to get to 1. Then we can

Generate exponential variables x_i1, x_i2, … and add them up until ∑^{k + 1}_{j = 1}x_ij > 1.
Then the count k which takes you to 1 (k + 1 takes you over) is the value of y_i ∼ Poisson(λ).
Repeat for i = 1, 2, ..n.

This procedure may request unlimited number of summands on step 1. It becomes prohibitively complicated for large values of parameter λ. Therefore this sampling procedure, though it is elementary, is rarely used for practical purposes. Methods similar to those discussed in the following subsection, are more common.

Gamma and Beta for Arbitrary Parameters

If parameters of beta distribution are not necessarily integer, one can use a general representation

which holds if . Then the procedure is clear, but the additional condition has to be checked.

Generate independent uniform variables (u_i, v_i) ∼ Unif[0, 1].
If assign . Otherwise, skip the ith element.
Repeat for i = 1, 2, …, n.

The qualifying condition on the first step has to be checked for each i, and if it does not hold, the ith values from the uniform samples are wasted. Effectively, it may bring about much smaller samples from beta distribution. If we can sample from beta, we also can sample from gamma with α < 1 using the following property (see Section 1.5, (1.28), also [1]):

The size of the beta sample will also determine the size of the gamma sample.

Wasting our time and effort on the first step to generate values, which are then rejected on the second step seems undesirable. However the idea of algorithms rejecting some of the generated values will soon be proven to be very productive.

3.2.4 Accept–Reject Methods

If simulation from a distribution is difficult or impossible to organize using the method of inverse transform and generators suggested in the previous subsection, we might use a different trick based on the idea that sometimes a problem which is difficult to solve in one dimension can be reduced to a simpler problem in two dimensions. Let us establish the following theorem which, following Robert and Casella [24], we will characterize as the fundamental theorem of simulation.

Suppose that the distribution of interest has p.d.f. f(x) (target density). Let us define uniform coverage {(x_i, y_i)}, i = 1, …, n of a region A⊂R² as such a set of independently chosen points in R², that for any B⊆A the probability P((x_i, y_i) ∈ B) = Area(B)/Area(A) and denote {(x_i, y_i)} ∼ Unif(A). Then the following statement holds:

Theorem 3.1 If a uniform coverage {(x_i, y_i)} ∼ Unif(S) is generated for the region under the graph (subgraph) of the target density S = {(x, u): 0 < u < f(x)}, the first coordinates of the coverage form an i.i.d. sample from the distribution with target density x_i ∼ f(x), and for any x_i such that f(x_i) > 0 and is finite, the second coordinates are uniform, u_i ∼ Unif[0, f(x_i)].

Proof: The idea of the proof consists in representing the target density as f(x) = ∫^f(x)₀du. Then it can be interpreted as the marginal density of the joint distribution of (X, U) ∼ Unif(S).

Thus the problem of generating a sample from X ∼ f(x) is equivalent to the problem of generating a uniform coverage or a sample from two-dimensional uniform distribution. How can adding one dimension simplify the task of sampling from f(x)? Direct generation of a uniform coverage over a curvilinear area may not be straightforward. However, in case when the target density has a finite domain [a, b] and is bounded from above ∀x ∈ [a, b] f(x) ≤ M, this curvilinear area is fully contained in a rectangle [a, b] × [0, M] and we can suggest the following two-step accept–reject procedure:

Generate uniform samples x_i ∼ Unif[a, b], u_i ∼ Unif[0, M], i = 1, 2, …, n.
For all i = 1, 2, …, n check an additional condition:

if u_i < f(x) then assign y_i = x_i (accept),

if u_i ≥ f(x) then skip (reject).

Resulting sample {y_i} may be of a smaller size due to possible rejections, but each of its random elements {Y_i} will have the target density f(x) since

(3.11)

Beta Distribution

Beta distribution has a finite domain [0, 1] and for α ≥ 1, β ≥ 1 is uniformly bounded from above, so it satisfies all the requirements of the above accept–reject construction. We can set a = 0, b = 1 and determine m as any upper bound for Beta(α, β). The initial sample from two-dimensional uniform is shown by circles in Figure 3.2.

Graph showing the uniform sample and Beta (19, 15) where the curve starts to increase between 0.2 and 0.4 on x axis and reaches peak nearly 0.6 on x axis and above 4 on y axis. — **Figure 3.2** Uniform sample and Beta(19,15) p.d.f.

Resulting sample {y_i} will have a random size K depending on the number of accepted values x_i. It is shown by filled circles in Figure 3.3, while rejected points from the previous picture stay unfilled.

Graph showing the accept-reject sampling from beta distribution where the curve is shown and place inside the drawn curve is filled with dark spots plotted within it defined as sampling data. — **Figure 3.3** Accept–reject sampling from beta distribution.

Now we can address two related questions: first, how large or how small K is; and second, how important is the tightness of bound M? Both questions can be addressed by calculating the probability of acceptance and hence the expected value of a binomial random variable K:

(3.12)

The tightness of the upper bound determines the relative loss in the resulting sample size with the respect to the initial sample. Therefore, the closer M is to the maximum of the target density, the better. In the special case of Beta(19, 15) discussed in Chapter 2, we can calculate the maximum obtained at θ = (α − 1)/(α + β − 2) = 18/32 = 0.5625 as f(0.5625) ≈ 4.564, meaning that the sample size drops from n to K on the average by more than four times. In other words, in order to generate a resulting sample of the needed size, one has to take at least four times as many elements of the initial sample. Looks like a bad deal, but this is still a vast improvement relative to the method suggested in the beginning of the subsection, which requires generating α + β = 34 elements in order to obtain just one element of a beta sample.

Let us provide a short numerical example with n = 3. In Table 3.1 we draw three points x_i uniformly from 0 to 1, calculate the values of Beta(19, 15) p.d.f. at these points, f(x_i), and compare these values for each i = 1, 2, 3 with the draws u_i from the uniform distribution on [0, M = 4.654]. If u_i < f(x_i), then accept and record y_i = x_i, otherwise reject.

Table 3.1 Accept–reject method for Beta(19,15)


1	0.08	8.8E−11	0.63	Reject
2	0.55	4.63	3.59	0.55
3	0.89	4.17E−5	0.37	Reject

General Case

The requirements of finite domain and uniform upper bound can be released, if there exists such a density g(x) that for all x there exists such a bound M (not necessarily tight) that f(x) ≤ Mg(x). In this case the following modification of the accept–reject algorithm is suggested.

Generate samples x_i ∼ g(x), u_i ∼ Unif[0, 1], i = 1, 2, …, n.
For all i = 1, 2, …, n check an additional condition:

if then assign y_i = x_i (accept),

if then skip (reject).

Resulting sample {y_i} from the target density may be of a smaller size, which depends on the tightness of the bound M.

Let us consider a familiar example of gamma distribution. Suppose we need to generate a sample from Gamma(4, 2), with density

(3.13)

which is not hard to do using the methods from Subsection 3.2.3. However, if we choose accept–reject technique instead, it will require a choice of instrumental density g(x) different from the uniform: we will need instrumental distribution defined on (0, ∞), which will also “dominate” the tail of f(x) in the sense that there exists a constant M such that for all x ≥ 0

(3.14)

Fortunately, such a choice is easy to make: exponential distribution Exp(1) with density g(x) = e^{− x} is easy to generate from, and the ratio f(x)/g(x) is conveniently maximized at x = 3 and inequality (3.14) is satisfied with M = 72e^{− 3} ≈ 3.585. The algorithm follows:

Generate two independent samples u_i, v_i ∼ Unif[0, 1], i = 1, 2, …, n.
Transform x_i = −ln (1 − u_i) creating a sample x_i ∼ Exp(1).
For all i = 1, 2, …, n check an additional condition:

if then assign z_i = x_i (accept), otherwise skip (reject).

Figure 3.4 demonstrates a sample y_i covering the subgraph of Mg(x) with accepted points z_i depicted as filled circles and rejected point staying unfilled. Simple examples of applications of accept–reject technique in Microsoft Excel are offered in acceptreject.xlsx.

Graph showing the accept-reject sampling from beta distribution where two curve is shown as dotted and solid where place inside dotted is filled with empty and inside solid is filled with dark circles. — **Figure 3.4** Accept–reject sampling from gamma distribution.

3.3 Monte Carlo Integration

Random number generation developed in Section 3.2 is not the ultimate goal. It will be used as a tool assisting with problems of estimating distribution moments, probabilities, or simulating multiple scenarios of the future. All these problems can be better understood in context of numerical integration.

3.3.1 Numerical Integration

Numerical integration is the last resort when an integral cannot be evaluated analytically and closed form solutions are not available. There exists a host of techniques which could be reduced to the following simple scheme. Suppose that D⊂R is a subset of the real axis, g(x) is a real-valued function, and our goal is to evaluate the definite integral

In order to define the procedure of numerical integration we need to define a partition of D into nonoverlapping subintervals Δ_i, i = 1, 2, …, n, and a choice of points x*_i ∈ Δ_i and weights w_i: ∑ⁿ_{i = 1}w_i = 1. Then integral I can be approximated by

The art of numerical integration consists in choosing the partition, points inside the intervals, and weights in such a way that I_n becomes a good approximation of I without requiring too many subintervals. Smart choices bring about good results for relatively small partition size n.

Suppose now that the integral of interest has the form

(3.15)

where f(x) is a p.d.f. of a certain distribution. This representation requires a factorization of the integrand g(x) into two multiples: h(x) and a p.d.f. f(x). This factorization is not always obvious and does not have to be unique. However in order to use the following method of numerical integration, first of all we have to represent the integral of interest in the form (3.15), which requires us to explicitly define both factors h(x) and f(x) forming the integrand g(x) = h(x)f(x).

Monte Carlo is the historical capital of gambling industry having a lot to do with random choices. That is where the name is coming. The idea of the Monte Carlo method, instead of trying to find a smart choice of points for numerical integration, is to choose the points randomly. It is interesting how the historical development of Monte Carlo methods is intertwined with the development of modern physics, and in particular with such applications as A-bomb and H-bomb development. As we have mentioned before, the idea of Monte Carlo simulation is not new, but in its modern form its development is closely tied to the emergence of the first computer, ENIAC, in 1946. Monte Carlo approach was discussed and developed for computer implementation by such renowned scholars as Stanislaw Ulam and John von Neumann and christened, probably, by a physicist Nicholas Metropolis. The paper of Metropolis and Ulam [19] was the first publication presenting the method.

Monte Carlo method works in two steps.

Generate a random sample i.i.d. x = (x₁, …, x_n) ∼ f(x).
Calculate

The justification of the method is simple: is the sample mean of the function h(X) while integral I is the distribution mean E_fh(X), where the subscript f indicates the distribution of X. Then the Law of Large Numbers (LLN) under very general conditions implies

(3.16)

where the exact form of convergence for n → ∞ (in probability, almost sure, etc.,) requires its own qualifying conditions. Random variable by construction has variance

(3.17)

which can be estimated from the sample by

(3.18)

and the Central Limit Theorem (CLT) will also apply:

This will allow us to estimate the error of approximation in (3.16). Exact conditions for LLN and CLT to apply are discussed, for instance, in [24] and [14], and will be not of our concern for now.

3.3.2 Estimating Moments

Monte Carlo numerical integration can clearly help to estimate distribution moments. Out first example though will be a textbook integral with no analytical solution

We can treat this integral as an integral of form (3.15) with D = [0, 1], and f(x) ∼ Unif[0, 1]. The algorithm is then clear:

Generate a random sample i.i.d. u = (u₁, …, u_n) ∼ Unif[0, 1].
Calculate

Let us do it for n = 100. We repeat this experiment three times independently, and our answers are . In this case we cannot obtain the closed form solution, therefore it is difficult to judge the quality of Monte Carlo approximation.

Our second example will involve θ ∼ Beta(19, 15), which has been discussed before. Our goal will be to estimate the mean value E(θ). Let us pretend for a while that we do not know that the exact solution can be obtained from (2.25) as E(θ) = α/(α + β) = 19/34 ≈ 0.5588. The benefit of applying a numerical algorithm to a problem with the known exact solution is the ability to obtain the immediate estimate of the error.

If we represent E(θ) in the form of (3.15) with X = θ, D = [0, 1], h(θ) = θ, and f(θ) ∼ Beta(19, 15), Monte Carlo method will work as follows:

Generate a random sample i.i.d. θ = (θ₁, …, θ_n) ∼ Beta(19, 15). Several ways to do it were discussed in Section 3.2.
Calculate

If we use, for instance, the accept–reject method from the previous section, which can be easily implemented in Microsoft Excel (see Exercises), we obtain for samples n = 100 repeated three times the values . The discrepancy between the exact and numerical solutions is not huge, but it still reveals an unpleasant truth that the convergence of Monte Carlo method is often painstakingly slow and even in a relatively simple example n = 100 is not enough to establish good precision.

3.3.3 Estimating Probabilities

If a probability of interest can be expressed as an integral, Monte Carlo method can help with its estimation. Let us introduce indicator function of set A⊂D as

where X is a random variable with p.d.f. f(x). Then

(3.19)

and assuming h(x) = I_A(x), probability (3.19) can be estimated by a sample x = (x₁, …, x_n) as

(3.20)

the proportion of sample points hitting set A.

Number π

Let us inscribe a circle C with center at (0, 0) and radius 1 into a square S = [ − 1, 1] × [ − 1, 1]. Probability of a point randomly thrown onto S to land inside the circle is

Then if we generate a random sample uniformly covering S and estimate probability P(C) by , the proportion of sample points, we can also obtain an estimate for π: . The exact sequence of actions is:

Generate a paired sample {(u_i, v_i)} ∼ Unif[0, 1]².
Transform it into {(x_i, y_i)} ∼ Unif(S) setting x_i = 2u_i − 1 and y_i = 2v_i − 1.
Count all points inside the circle by checking condition x²_i + y_i² ≤ 1. This count will provide .
The estimate of number π is .

In three experiments with n = 100 in each we obtained . The precision of estimation is worse than in the previous examples. It reflects the general rule: Higher-dimensional simulation requires larger sample sizes.

Tails of Normal Distribution

One of the biggest inconveniences of the probability theory is that the integral defining the c.d.f. of the standard normal distribution

cannot be evaluated analytically. We can rely on tables or a black box computer software. But we can also approximate using Monte Carlo approach with h(x) = I_{( − ∞, y)}(x):

Generate a sample x = (x₁, …, x_n) ∼ φ(x).
Count all points x_i such that x_i ≤ y.
Estimate using .

3.3.4 Simulating Multiple Futures

The Black–Scholes formula is a classical Nobel Prize winning result in mathematical finance. It provides a closed form solution to the problem of pricing European options. Let us define European call option as the right to buy (or not to buy) a share of certain stock at a designated future time T (exercise time) for the fixed price K (strike price). An option is “in the money” and should be exercised (the right to buy used) if at time T stock price S_T exceeds K. The option owner’s gain is S_T − K, the difference between the future market value of stock and its strike price. If however S_T ≤ K, there is no point in exercising the option, so the option’s owner lets the option expire. The value of the call option C_T at time T is determined as .

The present value of the option C₀ at time 0 should be equal to the discounted expected value of the call at time T, where r is the risk-free interest rate assumed in the calculation

This expectation is not easy to evaluate. The Black–Scholes formula does exactly that under certain additional assumptions. The most important of these assumptions is the “lognormal” model for the price of underlying stock

(3.21)

where μ is the expected return per year, σ is the measure of spread in prices known as volatility, and X is standard normal variable modeling uncertainty in future stock prices.

An alternative to Black–Scholes formula is a Monte Carlo simulation. For this simulation we will need to:

Generate multiple future scenarios x = (x₁, …, x_n) ∼ φ(x) ∼ N(0, 1).
Calculate corresponding n values S_T(x_i).
Evaluate .
Use to estimate E_φh(X).
Estimate the present value of the option .

For a numerical example, consider S₀ = K = 50, T = 0.5 (year), μ = r = 0.05, and σ = 0.3. Black–Scholes formula yields C₀ = 4.817. Running n = 100 random future scenarios three times gives us . These values are not exactly close. See MCintegration.xlsx for a simple Microsoft Excel implementation of this procedure. What is the practical point of using Monte Carlo for option price calculation? First of all, we can use larger sample sizes n = 10, 000 or even n = 10⁶ or more. Modern computer will do it in a split second. However, we cannot ever beat the exact solution, can we?

Unfortunately for Black–Scholes model, it has its own limitations. Its rise and fall in modern finance is related to the assumptions of the formula being too stringent and often violated at today’s markets. One advantage of the Monte Carlo approach is its additional flexibility w.r.t. the stock price model. If lognormal model (3.21) is modified to allow for variable volatility and nonnormal returns, it is relatively easy to reflect these modifications in Monte Carlo calculations. Black–Scholes formula is much harder to modify.

All the examples from this section were performed and are replicated on the book website in Microsoft Excel format in MCintegration.xlsx. All of them share one common feature: The sample sizes we suggest for illustration purposes do not guarantee a high precision of the Monte Carlo method. One way out of this bind would be to use brute force and dramatically increase the sample sizes. Another possible way out is to analyze the precision in terms of variance of Monte Carlo estimates and to recommend some means of variance reduction.

3.4 Precision of Monte Carlo Method

In this section we address the precision of Monte Carlo integration; in other words, the rate of convergence in (3.16). First we discuss the ways to measure precision and then suggest some methods of its improvement.

3.4.1 Monitoring Mean and Variance

As discussed in the previous section, the main idea of Monte Carlo integration is to generate a random sample x = (x₁, …, x_n) and estimate integral (3.15) via the sample mean

(3.22)

The simplest way to see how good is the estimate and to observe its improvement with the increase of the sample size is to look at accumulated sample means . Figure 3.6 demonstrates cumulative sample means with n = 1000 for the case of h(x) = (cos (50x) + sin (20x))² and f(x) ∼ Unif[0, 1], estimating area under the graph in Figure 3.5 as a trigonometric integral

Graph showing the curve for h(x) where x-axis is from 0.2 to 1.0 and y-axis is from 1 to 4 and the line is just solid and it increases and decreases along with increase in x-axis. — **Figure 3.5** Graph of h(x).

(3.23)

Graph showing the curve for cumulative sample is shown and the plotted points are joined and which shows a curve like a frequency type curve where x-axis from 0 to 1000 and y from 0.85 to 1.20. — **Figure 3.6** Cumulative sample means.

We do not need to know the exact value of this integral in order to see that the trajectory of sample means converges somewhere in the vicinity of 1. Variance , when it can be easily calculated, clearly measures the precision of estimation and we will use this approach when we can. However, in the absence of convenient formulas for the variance, one can also estimate using accumulated sample squared errors v_k defined for k = n in (3.18) and available directly from the data. Figure 3.7 demonstrates accumulated sample variances with n = 1000.

Graph showing the single solid line curve for the estimated variance for estimated means where the curve starts decreasing with y-axis when x-axis is increasing simultaneously. — **Figure 3.7** Estimated variance of the estimated means.

If the Central Limit Theorem can be applied, we can obtain approximate 95% confidence intervals for as can be seen on Figure 3.8, where the dotted lines correspond to the upper and lower bounds of the interval.

Graph showing the curves drawn with confidence intervals which are shown as dark spotted curve, dotted frequency curve and light spotted frequency curve. — **Figure 3.8** Confidence intervals for means.

An alternative method to monitor variance and precision suggests running parallel multiple samples. The graph of three accumulated means for independent samples of the same size 1000 is demonstrated in Figure 3.9.

Graph showing curves for the multiple trajectories for estimated means which shows the dark solid curves which is similar with each other only with slight variance. — **Figure 3.9** Multiple trajectories for estimated means.

When all three trajectories visually converge, it is a good indication of precision improving with sample size.

3.4.2 Importance Sampling

Let us consider the p.d.f. of Cauchy distribution restricted to positive semiaxis:

(3.24)

We consider the problem of estimating probability p = P(X > 1). Using substitution u = x^{− 1} in the integral below, we can easily see that

(3.25)

hence 1 is the median of the distribution (3.24) therefore p = 0.5. Knowledge that p = 0.5 is not required in the further analysis, but it is convenient for assessing the results. As usual, let us pretend that we do not know the exact solution and will try to approximate it by a Monte Carlo method. Denote h(X) = I_{(1, ∞)}(X) and then represent

which allows us to draw a sample x = (x₁, …, x_n) from the distribution of X and use the estimate

Notice that due to (3.25) we can use a different h(X) = I_{(0, 1)}(X) with the same f(x) to estimate 1 − p = P(X ≤ 1) and then take . The difference is in counting successes versus counting failures in a series of binomial trials. It might be easier in the sense that the points to be counted belong to a finite interval [0, 1], not to the infinite semi-interval [1, ∞). Variance will stay the same due to the symmetry of binomial distribution. Sampling from restricted Cauchy distribution from 0 to ∞ can be carried out by a simple inverse transform taking into account and creating a sample of , where u_i ∼ Unif[0, 1].

We ran three independent experiments with n = 100 and obtained values . Notice that in this case experiments are not even needed to judge the precision of estimation. By construction, so . For n = 100 and p = 0.5 we get corresponding to standard deviation of 0.05. If we do not know the exact value of p, we can use the estimate instead.

However we can significantly reduce the variance of this estimate thus increasing precision of Monte Carlo integration. This improvement can be provided by importance sampling. Let us notice that in the formula

factorization of the integrand into h(x) and f(x) is not unique. What if we make a different choice:

and then use the well-established procedure of sampling from uniform distributiuon to obtain u_i, and then calculate

The sequence of actions is very transparent. The most important result of this change is the reduced variance. Direct integration yields

bringing about the standard deviation of 0.01 for n = 100. Three experiments we performed with n = 100 yielded .

General scheme of importance sampling requires designation of instrumental density f*(x) and is based on the equality

where is the importance function. In our restricted Cauchy example f*(x) ∼ Unif[0, 1], but in general we could use any convenient density.

At this point we might notice that accept–reject sampling with uniform instrumental density could be also used instead of importance sampling. However it would end up with a smaller sample size, which could undo the effect of higher precision.

What is the meaning of the term importance sampling, and how can it help with variance reduction? In our example, the reduced variance is achieved because the uniform sample is concentrated on the smaller interval [0, 1], where the density f(x) is higher instead of spanning entire semiaxis (0, ∞). All points in the sample x_i are assigned weights h*(x_i) proportional to the initial density f(x), so the resulting estimate is obtained as a weighted average, where the weights reflect the points’ relative importance.

3.4.3 Correlated Samples

As suggested in the beginning of Section 3.4, precision of Monte Carlo estimation can be measured in terms of variance . Let us denote for each sample element (not necessarily i.i.d.)

If all X_i are independent and therefore uncorrelated,

(3.26)

and if in addition all X_i are identically distributed with variance σ², . For an i.i.d. sample x statistic

defined in (3.18) is a good estimate for . However if X_i can be correlated, (3.26) is no longer true and a more general result follows.

Theorem 3.2 If σ²_i < ∞ and covariances σ_ij are such that |σ_ij| = |Cov[h(x_i), h(x_j)]| < ∞, then

(3.27)

Proof: Denoting Y_i = h(X_i), we can use the properties of mean and variance and regroup the terms so that

Notice that in general case the overall variance of (3.22) and hence the precision of Monte Carlo estimation is determined both by variances σ_i and covariances σ_ij so statistic (3.18) expressed only through the first and not the second is no longer a good measure of , therefore Figures 3.7 and 3.8 are no longer helpful in assessing the precision of estimation.

The general idea of using correlated samples σ_ij ≠ 0 is to achieve higher precision via variance reduction in (3.27) without increasing n. Methods of variance reduction suggest special sampling techniques providing mostly negative covariances σ_ij ≤ 0 and lower value of (3.27). Three of these methods will be illustrated by a simple example in the following subsection.

3.4.4 Variance Reduction Methods

The principal example we will use to demonstrate different approaches to variance reduction is the case X ∼ Unif[0, 1]. We will recall the following formulas:

(3.28)

hence

(3.29)

in particular, Var(X) = 1/12 and Var(X²) = 4/45. Also,

(3.30)

and

(3.31)

Now we will pretend to forget these facts for a while, and instead of using (3.28) directly will use Monte Carlo method to estimate h⁽¹⁾(X) = EX and h⁽²⁾(X) = EX².

The most straightforward way to do it is to create an i.i.d. sample

and use to estimate EX and to estimate EX². This way we are assured to get and . However, we can do much better than that in terms of precision. For that purpose we will suggest smarter sampling schemes involving correlated samples.

Antithetic Variables

The idea of this method is to append initial i.i.d. sample (X₁, …, X_n) with an antithetic sample (X_{n + 1}, …, X_2n) in such way that for some i < n, j > n we obtain negative covariances σ_ij = Cov[h(X_i), h(X_j)] < 0, reducing variance (3.27) for the combined sample (X₁, …, X_2n).

For our example we append X = (X₁, …, X_n), X_i ∼ Unif[0, 1] with antithetic values X_{n + i} = 1 − X_i for all i = 1, 2, …, n. It is easy to check that the combined sample (X₁, …, X_2n) is still uniform on [0, 1]. Figure 3.10 shows initial sample size 10 in filled black circles, and antithetic values in unfilled circles. Notice that by nature of random number generation, the initial sample happens to shift to the right of the interval, containing several values on the extreme right, therefore by construction the antithetic sample shifts to the left containing some values on the extreme left.

Graph showing the x-axis which is from 0.0 to 1.0 and y-axis from 0.0 to 1.0 and 0.0 to -1.0 where the graph is plotted with dark spots and empty circles on the x-axis when y-axis 0.0. — **Figure 3.10** Antithetic sampling.

If we consider h⁽¹⁾(x) = x, for most of the values i < n, j > n, σ_ij = 0. However for n pairs (i, j = n + i), i = 1, 2, …, n, as established in (3.31), σ_ij = −σ_i and then according to (3.27), the variance of is zero and our Monte Carlo method is ideally precise even for sample size n = 1! If we consider h⁽²⁾(x) = x², the advantage of antithetic sampling is not that striking but it is still there (see Exercises). A rather simplistic explanation for this method’s efficiency is that the antithetic variables compensate for the initial sample being randomly shifted to the left or to the right, which affects the mean value and also the second moment.

Control Variates

If our goal is to compute E[h(X)], suppose that we can directly calculate E[g(X)], where “control variate” g(X) is in some sense close or related to h(X). Then we can use the identity

to define the control variate Monte Carlo estimator for E[h(X)] as

Variance reduction may be achieved due to the fact that

Control variate method has better precision than standard i.i.d. sampling

if h and g are correlated strongly enough to compensate for additional variation in g:

(3.32)

Let us return to the uniform example with g(x) = h⁽¹⁾(x), for which E[g(X)] was directly calculated using antithetic variables, and h(x) = h⁽²⁾(x). Checking condition (3.32), we easily see that the left-hand side is Var(X) = 1/12 from (3.29), and the right-hand side is 2Cov(X, X²) = 2/12 from (3.30). Therefore instead of using to estimate h⁽²⁾(X), we suggest the control variate estimator

(3.33)

Its variance can be also calculated directly from (3.29) and (3.30) as 1/180n, which is clearly a huge improvement with respect to i.i.d. sampling. This improvement can be explained by the fact that when the sample is randomly shifted to the left below the theoretical mean 1/2, driving the estimator also down, the last two terms in (3.33) represent positive correction; when the sample is shifted to the right, the correction is negative.

Stratified Sampling

To increase the precision in estimating integral

let us partition the region of integration into k nonoverlapping subsets or strata A_j: A = (A₁∪⋅⋅⋅∪A_k). We can create a stratified sample:

where n₁ + ⋅⋅⋅ + n_k = n. Suppose also .

We can define stratified estimator as

(3.34)

With an appropriate choice of partition, may be smaller than conventional for simple i.i.d. sample.

Let us return to our example and evaluate E(X) for X ∼ Unif[0, 1]. Let us choose strata as A₁ = (0, 1/5), A₂ = (1/5, 2/5), …, A₅ = (4/5, 1) and create a stratified sample according to (3.34) with n₁ = ⋅⋅⋅ = n₅ = 20, p_j = 1/5. We will choose twenty points Y_j uniformly on every subinterval A_j and compare the variance of the stratified sample with a conventional i.i.d. sample of size n = 100 drawn from Unif[0, 1]. It is easy to see that Var(Y_j) = 1/(25 × 12), so for h⁽¹⁾ = E(X)

and compare the variance of the stratified sample with a conventional i.i.d. sample of size n = 100 drawn from Unif[0, 1]:

Stratification achieves a substantial reduction in variance (see also Exercises).

Examples of variance reduction confirm that there may be a benefit in going beyond i.i.d. sampling. Some simple examples are contained in the Appendices in varreduction.xlsx. Monitoring variance and further variance reduction will be revisited in Chapter 4. Until then we will have to leave the discipline of simulation and delve into entirely different field of mathematics. Familiarity with this field will be necessary to introduce further development of sampling techniques and to further exploit the idea of dependent samples.

3.5 Markov Chains

3.5.1 Markov Processes

A stochastic process is a process developing in time according to certain probability laws. At each given point of time the state of a stochastic process can be represented by a random variable. The set of all these random variables determines a stochastic process. Time could be either continuous or discrete, as well as the state space: union of the ranges of values of these random variables. Trajectories or paths of a stochastic process are formed by numeric values of the corresponding random variables. Trajectories of a stochastic process are observable up to the present moment of time. Future allows for uncertainty, therefore the trajectory for the future is not unique: multiple paths are possible. For more general information on stochastic processes see, for instance, Ross [26].

One of the simplest examples of stochastic processes is random process also known as “white noise.” In its discrete version it is defined by a sequence of independent normal random variables with zero mean and constant variance, see Figure 3.11. This process was briefly described in Section 1.7.

Graph shows the dark solid lines drawn on both negative and positive area of y-axis as only positive side of x-axis is shown and plotted gradually which looks like a frequency signal. — **Figure 3.11** White noise.

Another example also discussed in Section 1.7 is discrete random walk. This sequence of random variables as shown in Figure 3.12 is defined by the formula X_{t + 1} = X_t + ϵ_t where ϵ_t is the white noise.

Graph shows the curve drawn for x-axis and negative y-axis portions where the curve looks like and frequency signal which increases with increase in x-axis and finally decreases. — **Figure 3.12** Random walk.

A popular generalization of such a process is autoregressive process defined by

(3.35)

where ϵ_t is the white noise (see also Section 1.7). The state of this process at any given point of time depends on several previous states and also contains a random component. We can say that autoregressive process has a “long memory.”

However there exist stochastic processes whose next state is determined only by the present state and does not depend on the past:

(3.36)

This property of a stochastic process is known as Markov property named after Russian mathematician Andrey Markov. Stochastic processes satisfying the Markov property are known as Markov processes or, in discrete time, as Markov chains. For instance, random walk and autoregressive process of order 1 (p = 1) known as AR(1) are Markov processes.

Markov processes serve as good models for many stochastic processes one may encounter in applications. For example, the market efficiency hypothesis suggests that stock price changing with time is a Markov process. If you know the stock prices today, you have full information prescribing the future (random) states of the market, and the knowledge of the past stock prices adds nothing new. In other words, efficient markets assume that “everybody is doing their homework,” and when you try to predict the future, the information regarding past prices is already incorporated in the present. In our further analysis we will try to avoid technical details. For a more rigorous treatment of Markov chains, please refer to Meyn and Tweedie [20], Norris [22], or to Haeggstroem [9] for more algorithmic applications.

3.5.2 Discrete Time, Discrete State Space

Unless otherwise specified, we will consider Markov chains, for which random states are assigned for some discrete times only. In this case we can assume that time takes on integer values only: t = 0, 1, …, N, …. Hence, Markov chains may be considered as sequences of random variables.

The union of the set of possible states of these random variables will be called the state space of the chain. We will begin with considering finite state spaces: at each point of time the process {X_t} takes values from a finite set (state space) S = {S₀, …, S_k}. We will assume that X_t = i, if and only if at time t the process is at the state S_i. Extension to an infinite discrete state space does not require any serious additional efforts, though we will have to get more technical when we extend the state space to continuum.

3.5.3 Transition Probability

Here we will assume that the probability of transition from state i at time n into state j at time n + 1 (in one step) does not depend on time. Markov chains with such property are known as homogeneous. One-step transition probabilities are denoted

(3.37)

Transition probabilities form one-step transition matrix

(3.38)

We will denote by P⁽ⁿ⁾_ij the probability of transition from state i into state j in n steps. We can build an n-step transition matrix P⁽ⁿ⁾ from these probabilities. It is easy to verify by matrix multiplication that P⁽ⁿ⁾ = Pⁿ.

3.5.4 “Sun City”

“Sun City,” which we will consider for our basic example was described in a Russian children’s book series, and later a cartoon, which was popular when both authors were still kids. The main hero of the books, Dunno (or Know-Nothing), a very charming though ignorant and lazy boy Shorty (a representative of a special race of little people) in his adventures happens to fall onto Sun City. The latter, named for its perpetually sunny weather, is a Shorty utopia noted for its incredible technological advances.

– Why do you call it Sun City? The houses are built of sun, are they? - asked Dunno.
– No, - the driver laughed. - It is called Sun City because the weather is always good and the sun shines at all times.
– Are there never any clouds? - Dunno was surprised.
– Why never? It happens, - the driver responded. - But our scientists developed a special powder: as soon as clouds gather, they toss some powder up, and the clouds disappear.(Nikolay Nosov, Dunno in Sun City, [23]).

Things seem to be fairly deterministic in Sun City. We begin with this example because we need something dramatically simple to explain a few nontrivial ideas. Let us introduce a little randomness to allow for at least some rain to fall down. Otherwise, everything would stay as straightforward as it could be in only a synthetic city.

Let us consider a “mathematical model” of Sun City weather. Suppose that only two states are possible: either it is sunny or it rains. The weather does not change during the day. The weather tomorrow depends only on the weather today (the past is irrelevant): if it is sunny today, it will rain tomorrow with probability 1/6; if it rains today, the probability of rain tomorrow is 1/2. Therefore, the state space consists of only two states: S₀—sunny, S₁—rainy. The one-step (day-to-day) transition probability matrix is

(3.39)

The initial probabilities are chosen to allow for an easy hands-on simulation. The example was designed in such a way in order to avoid computer demonstrations. Namely, a simulation of day-to-day weather patterns in Sun City can be performed with one coin (two sided) and one die (six-sided cube).

3.5.5 Utility Bills

Let us consider another example bringing us back from the cartoon world to our daily rut. Consider the payment of utility bills (electricity, heat, water, etc.), which we are supposed to perform regularly, say, on a monthly basis. Suppose that as a utility customer we can find ourselves in one of three states: S₀—bills paid in full; S₁—bills paid partially; S₂—bills unpaid. Assume that the month-to-month transition matrix looks like this:

(3.40)

Analysis of this transition matrix, and especially of its second row corresponding to the partial bill payers, can confirm the old wisdom that too often temporary problems have an unfortunate tendency to become permanent. On the other hand, many customers in the database seem to make an earnest effort to come clean.

3.5.6 Classification of States

Two states S_i and S_j are communicating, if P⁽ⁿ⁾_ij > 0, P_ji^(m) > 0 for some n, m. The subset of state space, where any two states are connected, forms a communicating class. A chain is irreducible if it consists of only one communicating class.

A state i is absorbing if the chain cannot leave this state once it gets there (P_ii = 1). Absorbing states do not communicate to any other states.

Consider a chain with transition matrix:

(3.41)

This chain can be split into two communicating classes {S₀, S₁}, {S₂} and an absorbing state S₃.

Recurrence function for state S_i is defined as the probability to ever return to this state after leaving it. This function can be determined by the formula:

(3.42)

The state is recurrent if this probability is equal to 1. It is transient if it is strictly less than 1. In a communicating class all the states belong to the same type: they are either recurrent or transient. In the following example we can split the chain into two communicating classes of recurrent states {S₀, S₁}, {S₂, S₃} and a transient state S₄.

(3.43)

We say that state S_i is periodic with period t > 1, if P⁽ⁿ⁾_ii = 0, when t does not divide n, and t is the largest number with this property. State S_i is aperiodic, if such t > 1 does not exist.

Let a chain have transition matrix

(3.44)

Evidently, both states of the chain are periodic with period 2.

In a finite Markov chain any aperiodic recurrent state is called ergodic. Markov chain is ergodic if it is irreducible and all its states are ergodic.

3.5.7 Stationary Distribution

Stationary distribution π = (π₀, …, π_k) can be defined for an ergodic chain. The components of this row-vector can be interpreted as probabilities of staying in corresponding states independent of the past or also as long-term proportions of time the chain spends in these states. It is easy to see that the stationary distribution should satisfy the following properties:

(3.45)

We can check that both states in the Sun City example are ergodic. Then we can look for the stationary distribution, as we know that it exists. First, we can use matrix multiplication to calculate several integer powers of the one-step transition matrix:

One might guess that the rows of this matrix seem to converge to the same vector

which would indicate that the memory of the past is lost (after just 4 days, it does not matter if we started with a sunny or a rainy day). One can also use linear algebra to solve the system of two dependent linear equations (keeping in mind that the stationary probabilities of two states add up to 1)

(3.46)

One can also determine the stationary distribution for the utility bills process. In this example we can get:

(3.47)

A rough guess might suggest the stationary distribution of the form

Exact solution of the corresponding system of linear equations yields

We can see that in the long run on the average 61% pay in full, 27% pay partially, and 12% tend not to pay at all.

3.5.8 Reversibility Condition

If we look at a Markov chain reversing the course of time, we will see another Markov chain. Its transition probabilities for the reversed chain will be calculated according to

(3.48)

A Markov chain is reversible or reversible in time, if the transition matrices for direct and reversed chains coincide: P_ij = Q_ij.

Let the stationary distribution exist for a reversible Markov chain. Then the following relationship known as the detailed balance equation

(3.49)

follows from (3.2.7) when n tends to infinity. Let us check the detailed balance equation for the Sun City example.

(3.50)

However, for the utility bills example (3.2.8) does not hold. For instance, .

3.5.9 Markov Chains with Continuous State Spaces

We will consider only the case when the state space is continuous and one dimensional. In this situation many of the definitions given above will require substantial modifications. Transition probability from a state to any single state in the continuous case is equal to 0, so we have to consider transition probabilities from a state x into a measurable state space A:

(3.51)

Set of these probabilities for all x’s and A’s is called the transition kernel of a Markov chain. For a homogeneous chain the kernel does not depend on time. If A = ( − ∞, v] the transition kernel defines the cumulative distribution function of the transition

(3.52)

If this function has a continuous derivative

(3.53)

then this function is called the one-step transition density function. n-step transition density function can be defined as a convolution:

(3.54)

If the limit

(3.55)

exists and does not depend on x, then f(v) is known as the stationary state density. If a Markov chain is reversible, it satisfies the detailed balance equation

(3.56)

We can see that many properties of Markov chains which we introduced for finite and discrete state spaces can be extended to the continuous state spaces. In our future development we will avoid exact proofs of the fine properties of the algorithms, therefore we will not need more precise definitions.

3.6 Simulation of a Markov Chain

Suppose we know the transition matrix for Sun City weather. Instead, we are interested in the stationary distribution, which would give us a long-term weather forecast for some distant future. Let us also pretend that we are very mathematically ignorant:

– We cannot multiply matrices.
– We cannot solve systems of liner equations.

This seems ridiculous for such an easy example. However, imagine a 10, 000 × 10, 000 transition matrix (not unusual in some applications). Without a computer we probably will not venture into using any of the two approaches suggested above. Is there another way we can follow? Actually, there is one suggestion which does not require computer help (and if you have a computer at hand, you might still prefer this approach to any of the two above). Just for the sake of clarity we will illustrate this approach using our simple example. It will still work for a 10, 000 × 10, 000 transition matrix, but will require a little more computing power. Have a fair coin and a fair die available.

Let us generate a long enough sequence of days using the following algorithm:

Choose one of the states (sun or rain) for Day 0. Do not think too hard, your choice would not matter.
For n = 0, 1, 2, …, N use the following rule:
1. If Day n is sunny, roll your die: if it turns on “six”, at Day n + 1 move to “rain,” otherwise stay at “sun.”
2. If Day n is rainy, toss your coin: if it turns heads up, at Day n + 1 move to “sun,” otherwise stay at “rain.”

This simple algorithm will generate a Markov chain length N with correct transition probabilities. Then the long-term proportion of sunny days for large N will converge to the stationary probability of “sun,” same for “rain.”

One can also do the same in a more civilized way. In online Appendices we suggest computer implementation of this algorithm and graphical illustration using Microsoft Excel in MCsimulation.xlsx. We can also use R or other software instead of a coin and a die. Figure 3.13 demonstrates the behavior of a chain length 50 simulated in Mathematica jumping between two states: sun (0) and rain (1).

Graph showing the graph plotted for Sun City simulation where the plots are shown on the x axis and at o y-axis and dark spots as plots on 1.0 of y-axis with increase in x-axis. — **Figure 3.13** Sun City simulation.

We see that in this particular case the probability of rain was estimated as 0.24 using the relative frequency of the rainy days in the simulated chain (12 out of 50 days). To increase the precision of estimation, take a longer run.

We can also generate a Markov chain for utility bills. See appendix for the algorithm in R, and in Figure 3.14 we suggest a graphical illustration for N = 50 with the chain moving from state to state: 0, 1, and 2. The probabilities of three states are estimated as 0.48, 0.34, 0.18. For better estimates we should substantially increase the number of steps.

Graph showing three different plots as dark spots known as utility bills simulation where first plot is on 0 of y-axis and second on 1.0 and third on 2.0 on y-axis and x-axis increasing. — **Figure 3.14** Utility bills simulation.

Notice that this approach represents another modification of Monte Carlo integration. The goal is to estimate probabilities of the states using long-term relative frequencies of these states (similar to the example with estimation of π). What is different, instead of generating an independent sample directly from the target distribution, we generate a Markov chain converging to its stationary distribution.

3.7 Applications

3.7.1 Bank Sizes

Markov chains are used in many diverse applications: statistical process control, reliability, business, and insurance. There exist interesting applications to financial statistics. We will suggest one example from the authors’ research experience.

The study involved Russian commercial banks during the period from 2007 to 2012. The objective was to build a reasonable model which would describe the distribution of bank size in Russia and describe its mobility (predict upward and downward movements in size).

Using available data, all Russian banks were split into four classes depending on their size (total capital). These classes could be roughly characterized as: small banks, medium banks, large banks, and giant banks. The boundaries for the classes were selected as in Table 3.2.

Table 3.2 Classification of banks

	Absolutevalues in2011:Q4	Relative values with respect to total equity of the banking sector	Relative values, logarithmic scale
Threshold level 1	0.3 billion rubles	0.006%	−5.070378	Minimum total equity for bank entering the industry (the restriction applies since 01/01/2012)
Threshold level 2	1 billion rubles	0.021%	−3.866405	Minimum total equity (the restriction is to be introduced in 2015)
Threshold level 3	19.6 billion rubles	0.411%	−0.890144	Top 30 banks

Belonging to a certain class at the end of the year is considered as being in State 0 through State 3. In order to complete the state space two more states were introduced: Entry state (for the banks founded in a given year) and Exit state (for the banks ceasing their activity). In the assumption that the process of the bank size satisfies Markov property (probability of moving to a different class for the next year is determined only by the class where the bank currently is, and does not depend on the past), we can build a Markov chain estimating the transition probabilities, which are summarized in Table 3.3.

Table 3.3 Transition probabilities for bank sizes

	State in t
State in t−1	Class 1	Class 2	Class 3	Class 4	Entry	Exit
Class 1	0.9675	0.0212	0.0002	0	0	0.0111
Class 2	0.0282	0.9455	0.0156	0	0	0.0107
Class 3	0	0.0183	0.9673	0.0025	0	0.0119
Class 4	0	0	0.0130	0.9805	0	0.0065
Entry	0.5806	0.1613	0.2581	0	0	0
Exit	0	0	0	0	0	1

The constructed Markov chain is not irreducible and therefore is not ergodic. It contains an absorbing Exit state and a transient Entry state. Such a chain will not obtain a stationary distribution. We will not have much use for such chains in this book. However, this model itself might be useful for prediction of the structure of banking sector of economy.

Using statistical data we can estimate the distribution of bank sizes in 2012. It is summarized in Table 3.4.

Table 3.4 Distribution of bank sizes for 2012

	Class 1	Class 2	Class 3	Class 4
Number of active banks	379	299	269	30
Shares of size categories	38.80%	30.60%	27.49%	3.11%

3.7.2 Related Failures of Car Parts

Vehicle manufacturers do not have to make their failure reports public, thus historically very few data have been made available for statistical analysis. Recently, a lot of attention was attracted to consumer vehicle reliability. In 2005, Toyota faced lawsuits for withholding vehicle failure data, and many repair and recall data were collected in the aftermath. In response to these events, a rival company Hyundai made public a lot of their records, including a dataset of 58,029 manufacturer warranty claims on Hyundai Accent from 1996 to 2000. For each claim made, the following information was included: the vehicle identification number (VIN), the dates on which the vehicle was shipped, sold, and repaired, the amount of time (in days) from previous failure and from sale to repair, and the type of car usage (light, medium, or heavy). In [3] the author addressed the distribution models for this variable measured for various components, performing a preliminary analysis of each component repaired. In that study all failures were treated as independent events. However, the author concluded that failures of various components may be associated [11]. Our goal is to find ways to model this association. In this subsection we will touch on Markov chain approach, and in Section 7.6 we will return to the same problem with a very different toolkit [17].

Five main components that most frequently failed and caused warranty claims are chosen out of the full list of 60 detected in Hyundai Accent warranty claim records. They are: the spark plug assembly (A), ignition coil assembly (B), computer assembly (C), crankshaft position sensor (D), and oxygen heated sensor (E). The spark plug assembly brings power to the spark plug, which provides the spark for the motor to start. Spark plugs are typically replaced every 50,000 to 100,000 miles, but this rate can change depending on such factors as frequency of oil changes and quality of oil used. The ignition coil assembly regulates the current to the spark plugs, helping to ignite the spark. While these two parts are engineered to wear independently, they display such a high level of interaction that it can be difficult to tell which one needs repair. The computer assembly includes engine sensors, and it controls electronics for fuel-injection emission controls and the ignition. The crankshaft position sensor controls ignition system timings and reads rpms. Finally, the oxygen heated sensor determines the gas-fuel mix ratio by analyzing the air from the exhaust and adjusting the ratio as needed. The latter two components are controlled by the computer assembly. Once again, failure by one should not necessarily affect another, yet these three components are all closely related and could all need to be replaced if, for example, a single event caused the system to short out.

With regard to this dataset, data were recorded for only the cars that had at least one of the five main components fail within the warranty period. Thus, the main focus in this study is on the cars for which component failures are registered during this time. We also took into consideration the fact that with some cars, a main component needed to be replaced more than once. The assumption that the future need of replacements and repairs is most closely related to the most recent repairs already made reflects the Markov property. The ultimate goal in this section is to suggest a Markov chain model for the failures of five engine assembly components. Such a model would be helpful for better estimation of lifetimes and prediction of certain parts’ failures given the car’s history and recent replacements and repairs. The most important part of the modeling process is to describe the state space and transition matrix, and then to estimate its elements.

Seven states were identified: State S₀ corresponds to the state of a car before any repairs done. State S_i indicates: “Part (i) needs repair.” Finally, state S₆ is the state: “No more repairs needed up to the end of the warranty period.” We will assume that all repairs are done directly when diagnosed. Exact time of repair is not reflected in the transition matrix, only the right sequence of repairs. Simultaneous repairs of several parts are recorded as two or more consecutive repairs starting with the more common one (A to E order).

Transition probabilities P_ij will be estimated based on the relative frequency of such an event: a car detected at state S_i would at some point before the warranty expiration move directly to state S_j. These estimated probabilities are given in Table 3.5.

Table 3.5 Transition probabilities for car repairs

State in t
State in t-1
S₀ (Initial)	0	0.3531	0.2486	0.1344	0.1414	0.1224	0
S₁ (A)	0	0.1814	0.0266	0.0100	0.0114	0.0209	0.7462
S₂ (B)	0	0.0632	0.0793	0.0303	0.0152	0.0109	0.8010
S₃ (C)	0	0.0292	0.0425	0.0756	0.0295	0.0153	0.8079
S₄ (D)	0	0.0216	0.0093	0.0195	0.0169	0.0087	0.9240
S₅ (E)	0	0.0286	0.0102	0.0076	0.0102	0.0203	0.9235
S₆ (Terminal)	0	0	0	0	0	0	1

All the cars in the database will start at state S₀, but by definition no car will return to it after the first repair. Therefore S₀ is a transient state. States S₁–S₅ are communicating. After one repair is needed, another might be needed at a later date. State S₆ is the terminal or absorbing state. We get into this state only when no more status changes happen before the end of the warranty period. You cannot move directly from S₀ to S₆, because all cars in our database had at least one repair recorded.

What kind of information can be provided by the contents of Table 3.5? Let us consider, for instance, four highlighted numbers. A relatively high probability P₁₁ = 0.1814 corresponds to more than one replacement of the spark plug assembly (A) during the warranty period, which can be considered unusual, but not totally impossible representing a relatively common maintenance operation. A potentially more troubling issue is a relatively high value of P₃₃ = 0.0756, especially in conjunction with a relatively modest P₀₃ = 0.1344, which means that more than half the replacements/repairs of expensive computer assembly (C) required an iterative repair within the warranty period. In general, all diagonal elements P_ii for i = 1, …, 5 correspond to multiple repairs of the same component. A couple of relatively high off-diagonal elements P₂₁ = 0.0632 and P₃₂ = 0.0425 indicate potentially interesting patterns: spark plug assembly (A) replacement following directly the replacement of ignition coil assembly (B), and ignition coil assembly (B) replaced directly after a problem with computer (C). Leaving further conclusions to technical experts, we may suggest that even some simple application of Markov chains might provide some insights to statistical dependence of components of complex engineering systems.

This short analysis above also makes it clear that a more thorough study should include such an important factor as time from repair to repair which is totally overlooked by our transition matrix. Also, there are such covariates as the type of the car usage, which should be taken into account.

References

Abramowitz, M., and Stegun, I. A. (1972). Handbook of Mathematical Functions. New York: Dover.
Asmussen, S., and Glynn, P. (2007). Stochastic Simulation: Algorithms and Analysis. Stochastic Modelling and Applied Probability. Berlin: Springer.
Baik, J. (2010). Warranty analysis on engine: case study. Technical report, Korea Open National University, Seoul, 1–25.
Cover, T. (1987). Pick the largest number. In: T. Cover and B. Gopinath (editors), Open Problems in Communication and Computation., Springer Verlag.
Devroye, L. (1986). Non-Uniform Random Variate Generation. New York: Springer.
Fontenrose, J. E. (1978). The Delphic Oracle, its Responses and Operations, With a Catalogue of Responses. Berkeley: University of California Press.
Gamerman, D., and Lopes, H. F. (2006). Markov Chain Monte Carlo. Chapman & Hall/CRC.
Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering. New York: Springer.
Haeggstroem, O. (2003). Finite Markov Chains and Algorithmic Applications, No. 52, Student Texts. London: London Mathematical Society.
Hammersley, J. M., and Handscomb, D. C. (1964). Monte Carlo Methods. Boca Raton FL: Chapman & Hall, CRC Press.
Heyes, A. M. (1998). Automotive component failures. Engineering Failure Analysis, 5(2), 698–707.
Karsten, R. (1926). The Civilization of the South American Indians With Special Reference to Magic and Religion. New York: Alfred A. Knopf.
Knuth, D. E. (1998). The Art of Computer Programming. Volume 2 (Seminumerical Algorithms), 3rd ed. Reading, MA:Addison-Wesley.
Korn, R., Korn, E., and Kroisandt, G. (2010). Monte Carlo Methods and Models in Finance and Insurance. Chapman & Hall/CRC.
L’Ecuyer, P. (1994). Uniform random number generation, Annals of Operations Research, 53, 77–120.
L’Ecuyer, P., and Simard, R. (2002). TestU01: A software library in ANSI C for empirical testing of random number generators, Available at: http://www.iro.montreal.ca/≁lecuyer.
Kumerow, J., Lenz, N., Sargent, K., Shemyakin, A., and Wifvat, K. (2014). Modelling related failures of vehicle components via Bayesian copulas, ISBA-2014, Cancun, Mexico, § 307, 195.
Marsaglia, G. (1996). The Marsaglia random number CDROM including the Diehard battery of tests of randomness. Available at: http://stat.fsu.edu/pub/diehard.
Metropolis, N., and Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association, 44, 335–341.
Meyn, S. P., and Tweedie, R.L. (2005). Markov Chains and Stochastic Stability. Cambridge University Press.
Mikhailov, G. A., and Voitishek, A. V. (2006). Numerical Statistical Modeling. Monte Carlo Methods. Moscow: Akademiia, (in Russian).
Norris, J. R. (1998). Markov Chains. Cambridge University Press.
Nosov, N. (1958) Dunno in the Sun City. Moscow, Mir (in Russian).
Robert, C. P., and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. Springer-Verlag.
Robert, C. P., and Casella, G. (2010). Introducing Monte Carlo Methods with R, Springer-Verlag.
Ross, S. M. (1996). Introduction to Stochastic Processes. New York: Wiley.
Rubinstein, R. Y. (1981). Simulation and the Monte Carlo method. New York: Wiley.
Rukhin, A., Soto, J., Nechvatal, J., Smid, M, Barker, E., Leigh, S., Levenson, M., Vangel, M, Banks, D., Heckert, A., Dray, J, and Vo, S. (2001). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Special Publication 800–22, National Institute of Standards and Technology.

Exercises

Generate random sample size 100 from the distribution with density f(x) = 2exp ( − 2x), x ≥ 0. Check the feasibility of the obtained data using: histogram, mean, variance, EDF.
Generate a sample from double exponential distribution with variance 1: f(x) = Ce^{− k|x|}, −∞ < x < ∞. Implement accept–reject method to transform this sample into a sample from standard normal distribution. What is the resulting sample size? How does it confirm the theoretical results?
Use Monte Carlo method with f(x) ∼ Exp(α) to evaluate integrals

and

for α = 3 and α = 4.
Generate a sample (x₁, …x_n), n = 100 from Poisson distribution X ∼ Poiss(λ = 1.5). Then use it to estimate P(X > 2) using techniques from Section 3.3.
Using Monte Carlo method with a generated sample x_i ∼ Gamma(4, 2), estimate the integral

In order to generate a sample x_i ∼ Gamma(4, 2), apply:
1. Representation of gamma distribution Gamma(4, 2) as a sum of exponentials.
2. Accept–reject technique with instrumental density Gamma(2, 1).
3. Importance sampling with instrumental density Gamma(2, 1).
Compare the results.
Use the method of antithetic variables to estimate E(X²), X ∼ Unif[0, 1]. Obtain exact numerical value for the variance and compare it with the simulation results (use 100 simulations with n = 100).
Use the method of control variates to estimate E(X³), X ∼ Unif[0, 1]. Obtain exact numerical value for the variance with control variate g(x) = x² and compare it with the simulation results (use 100 simulations with n = 100).
Use the method of stratified sampling to estimate E(X²), X ∼ Unif[0, 1]. Estimate the variance from the simulation results (use 100 simulations with n = 100).
For the Markov chain with transition matrix

(3.57)
1. Calculate the third, the fourth, and the fifth powers of P.
2. Using linear algebra, evaluate the stationary distribution of the chain. Work by hand or use R, if you like. Submit all relevant code.
3. Develop Monte Carlo algorithm for generating a chain with given transition probabilities. Estimate stationary probabilities. Use a chain of such length that will provide precision to within 0.01.
Are all the states in Table 3.3 communicating? Which states are transient, recurrent, or absorbing?
Characterize states S₁–S₅ in Table 3.5 as transient or recurrent. Prove your point using the definitions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3: Background for Markov Chain Monte Carlo

Create new playlist

Sign In

Sign Up