4
Markov Chain Monte Carlo Methods

In our introduction to Markov chain Monte Carlo (further often abbreviated as MCMC) methods we will try to avoid all details which are not immediately necessary to understand the main ideas of the algorithms. Thus we sacrifice mathematical rigor and computational convenience. We also avoid detailed descriptions of more modern and more complicated MCMC methods. For a more systematic exposure to MCMC we can recommend excellent texts of Gamerman and Lopes [9] and Robert and Casella [26]. Computational details are treated with attention by Bolstad [3] and Kruschke [17], and also with special reference to R in [25].

4.1 Markov Chain Simulations for Sun City and Ten Coins

In this section we will ask our readers to pretend being even more ignorant about the Sun City weather situation than in Sections 3.5 and 3.6. Here we assume that not only we are unable to analytically derive the stationary distribution for the Markov chain of daily weather changes from its transition matrix, but that this matrix itself is not known to us.

Therefore, in order to determine the long-term proportion of rainy days (which can be also treated as the mean value of the binomial variable taking two values: 1 if it rains, and 0 if it is sunny), we can use neither the analytic tools based on solving linear equations developed in Section 3.5 nor the direct Markov chain simulation described in Section 3.6.

For such a weird situation we may introduce the following two-stage algorithm, which combines Markov chain simulation with accept–reject techniques of Section 3.3. To begin with, we will propose a simulation from a Markov chain with a totally different transition matrix. The choice of this “proposal” matrix is very basic: first, it should have the same state space as the “target” chain and second, it should be easy to simulate. Tossing a fair coin each day independently for rain or for sun will correspond to sampling from a Markov chain yt with the “proposal” transition matrix

numbered Display Equation

and will serve the declared purposes well enough. Two states, which in the “target” chain with correct probabilities (3.39) would be “light” or less probable (rain) and “heavy” or more probable (sun), are equally likely to occur in the proposal chain. Therefore if we count rainy and sunny days in the sample (y1, …, yN) from the proposal, their proportion will be approximately 50-50, which is substantially different from the proportion expected for the Sun City.

Figure 4.1 shows a simulated chain where hollow circles correspond to the states of the proposal chain. There are 21 points in the “rainy” state 1, and 29 points in the “sunny” state 0, bringing about the proposed proportion of rainy days of 0.42. This is far away from the Sun City theoretical proportion 0.25, so the sampling is not very effective in the determination of the long-term frequency of rainy days.

Graph showing the curve for simulation with the proposed transition where the small circles are plotted on the x-axis where y-axis is zero when x-axis increases.

Figure 4.1 Simulation with the proposed transition: N = 50.

Thus follows the necessity of the second stage of the algorithm or an accept–reject procedure. Let us observe the proposal chain day-by-day and update it exactly like this: nothing changes in the proposal chain if it proposes a next day change to a heavier or equally heavy state (from sun to sun, rain to rain, or rain to sun). Only if a change from sun to rain is suggested by the proposal chain, we accept this “jump” to a lighter state with probability 1/3 and reject it with probability 2/3. Rejecting in this context means that instead of jumping into a rainy state, we stay with sun for one more day.

It is easy to verify that with the arbitrary choice of initial state X0, the resulting chain Xt after applying the accept–reject procedure to the proposal chain for t = 1, …, N will have transition probabilities

numbered Display Equation

numbered Display Equation

numbered Display Equation

as is supposed to be for the target: the Sun City. Mission accomplished! It should be mentioned though that the secret of our success is in being able to use the magic number 1/3, which helped us to successfully restore the target transition matrix

numbered Display Equation

from the simulation which started with pretty arbitrary proposal

numbered Display Equation

As it happens, this magic number 1/3 expresses some partial knowledge of the stationary distribution of the target chain. So instead of using the target transition matrix to directly simulate a chain, we use information regarding its stationary distribution to alter a chain with the proposal transition so that it moves toward the target. What is the point of such simulation? Same as before, the most evident result is obtaining a numeric estimate for the mean of the target distribution (in other words, the long-time proportion of rainy days in the Sun City).

The simulation with N = 50 from Figure 4.1 with an additional accept–reject procedure applied at each time step is illustrated in Figure 4.2, where the filled circles correspond to the accepted states of the proposal chain, and hollow ones correspond to the rejected ones. The proposal chain spent 21 days at the “rainy” state 1 and 29 at the “sunny” state 0. However, 9 circles in state 1 stayed unfilled and the chain at these times dropped down to 0, which means that 9 proposed rainy dates were rejected, and on these 9 days the resulting chain stayed in the sunny state. The resulting chain includes only 12 rainy days, which brings the estimate of the long-term proportion of rainy days down to , which is much better than the proposed proportion 0.42.

Graph showing the two-step Markov chain for Sun City simulation where N=50 is fixed and both dark spot circle and empty circle is plotted on it for y-axis values 0 and 1.0.

Figure 4.2 Two-step Markov chain Sun City simulation: N = 50.

The procedure we implemented is a special case of so-called independent Metropolis algorithm (IMA). It is a representative of MCMC: Markov chain Monte Carlo methods. For the Sun City example it was used to introduce a Markov chain simulation, which required neither exact knowledge of the stationary distribution, nor even the knowledge of the transition matrix, however brought about a plausible numerical result.

In more general Bayesian framework, we may want to sample from the target posterior distribution, whose density, as we might remember from Chapter 2, is proportional to the product of prior and likelihood so typically is known to within a constant. For this purpose we build a Markov chain with the state space coinciding with the parametric set such that its stationary distribution is indeed the posterior. The transition matrix or, in case of infinite state space, the transition kernel of this chain might not be directly available. Therefore we might have to start with the proposal chain, which will be updated at every time step in order to converge to the target. This convergence may take a long time, thus we will need chains much longer than the one we used in the previous example.

Ten Coin Tosses

To further illustrate this version of Metropolis algorithm, we will revisit the ten coins example introduced in Chapter 2 and then reviewed in Chapter 3. Suppose that the probability of success θ is known to have Beta(12, 12) prior distribution and the data x consists of 7 successes in 10 trials. We have derived the posterior distribution with density π(θ∣x) ∼ Beta(19, 15), and the Bayesian estimator we are looking for is the posterior mean E(θ∣x). Section 2.6 contains the analytical solution. In Section 3.3 we have suggested several methods to generate independent random samples directly from the posterior and estimate the posterior mean via Monte Carlo approach. Now we want to estimate the posterior mean applying Markov chain simulation.

We will try to estimate the mean value of the posterior distribution π(θ∣x) ∼ Beta(19, 15) using Markov chain Monte Carlo simulation instead of analytical formula

numbered Display Equation

Here we will suggest a procedure similar to the one we just used for the Sun City, which will generate a Markov chain whose stationary distribution is the desired Beta(19, 15). This procedure is certainly of no practical use being much less efficient than analytical calculations in Section 2.6 or independent Monte Carlo simulation in Section 3.3. However, it demonstrates the potential of Markov chain simulation for Bayesian estimation purposes.

The first stage of the algorithm requires us to choose a transition kernel with support on the same state space as the target (posterior) and generate a Markov chain with this transition kernel. The parametric set Θ = [0, 1] forms an infinite state space for this proposal. The easiest choice of distribution to sample is the uniform Unif[0, 1]. We will build the proposal chain by generating t = 1, 2, ..., N independent copies ytUnif[0, 1] using independent transition kernel g(ytyt − 1) = g(yt) ≡ 1 for any values yt − 1, yt ∈ [0, 1]. These copies will uniformly fill up the horizontal band {1, …, N} × [0, 1] in Figure 4.3, where time t on horizontal axis is discrete, and the parametric state space on y-axis is continuous.

Graph showing the proposal sample which indicates ten coins where the x-axis is from 200 to 1000 indicated as t and y-axis from 0.2 to 1.0 indicated as y & filled with empty circles.

Figure 4.3 Ten coins. Proposal sample.

The normalized histogram of the proposal sample is shown in Figure 4.4 along with the posterior density π(θ∣x)∝θ18(1 − θ)14. Notice, that this figure represents the static summary of the chain, and thus the axes are different from the previous graph: horizontal axis corresponds to the parametric state space and the values on vertical axis serve to compare the empirical distribution of the proposal sample to the density of the posterior values in the state space. As we can clearly see, the histogram corresponds to the uniform distribution, and its shape does not even remotely resemble the target. Too many points in the proposal sample are located toward the ends of the interval [0, 1], and too few are grouped around the mode. The second stage of the algorithm will result in “thinning” of the proposal chain: values will drift toward the center of the state space corresponding to the higher posterior.

Graph showing histogram of proposal and target density of ten coins where the shaded portion is 0.0 to 1 of y axis and full of x-axis and solid curve id drawn.

Figure 4.4 Ten coins. Histogram of the proposal and target density.

We will establish a natural ordering of “weight” of states θi ∈ Θ:

numbered Display Equation

which will simplify the following deliberations.

The second stage of the algorithm will require setting up an accept–reject procedure which will allow the proposal chain to gravitate from the outskirts of Θ toward the “heavier” states corresponding to higher target density which are located around its mode. For that purpose we will consider the ratio

numbered Display Equation

for any two states in Θ. Then we can write down the natural ordering condition in terms of ratio R:

numbered Display Equation

and define the acceptance probability as

We will then generate a supplementary uniform sample u1, …, uNUnif[0, 1] and construct the new Markov chain zt starting with an arbitrary initial value z0 ∈ [0, 1] and at time t using proposal chain to update:

(4.2)numbered Display Equation

The role of the supplementary sample (ut) is to establish the correct acceptance probability

numbered Display Equation

so that a proposed heavier state is always accepted, while a lighter state is accepted only with a certain probability based on the ratio R.

The resulting chains are shown in Figure 4.5 for N = 1000 and in Figure 4.6 for N = 100. In both pictures the filled circles correspond to accepted states of the proposal chain, while the hollow circles correspond to the rejected states. One may see that while the desired thinning of the edges of the state space is achieved in both and the corrected chain concentrates around the target mean, zooming-in with the smaller sample size makes it evident that the corrected chain often gets stuck at certain states which happens due to high rejection rate. This highly undesirable effect leading to high positive autocorrelations in the final chain will be discussed later.

Graph showing the Markov chain simulation where n value is fixed to 1000 and graph is fully filled with dark circles and empty circles where dark circles is in middle part.

Figure 4.5 Ten coins. Markov chain simulation, N = 1000.

Graph showing the Markov chain simulation where n value is fixed to 100 and graph is then and there filled with circles as both dark and empty circles.

Figure 4.6 Ten coins. Markov chain simulation, N = 100.

Numerical estimates for the posterior mean, which we know to be approximately 0.559, are obtained as sample means for the resulting sample: for Figure 4.5 with N = 1000, and clearly less precise for Figure 4.6 with N = 100. It is also worthwhile to notice that for the resulting chain zt not only the sample mean approximates the posterior mean, but its histogram also resembles the posterior density. This can be observed in Figure 4.7.

Graph showing the histogram of resulting chain and target density where its y from 0.0 to 1.0 versus p.d.f. from 1 to 5 and solid curve and bar forms are shown.

Figure 4.7 Ten coins. Histogram of the resulting chain and target density.

The procedure we described above is a special case of independent Metropolis algorithm introduced above for the Sun City simulation. More formal definition of the algorithm and its more general version are defined in the next section.

4.2 Metropolis–Hastings Algorithm

Let us describe the general scheme which will allow us to solve two related problems:

  • Generate random sample from the distribution with posterior density π(θ∣x).
  • Estimate posterior mean of θ, E(θ|x).

Both problems are open when no analytical formula exists for posterior mean, and direct sampling from the posterior is not possible. Moreover, as usual in Bayesian practice, posterior density π(θ|x) may be known only to within a constant multiple. In this case, a Markov chain simulation may become not only feasible but also practical.

Consider a function f(θ)∝π(θ∣x) defined on Θ. Choose a valid proposal p.d.f. g(θ) on the same domain Θ such that it allows to define an easy sampling procedure. Then we will build a Markov chain ztt = 1, …, N whose stationary distribution coincides with the posterior playing the role of the target distribution.

Independent Metropolis Algorithm (IMA)

For t = 0 choose any z0 from Θ. Define the ratio

numbered Display Equation

For t = 1, 2, …

  • Generate a random proposal ytg(θ);
  • Generate a random utUnif[0, 1];
  • Define
    numbered Display Equation

The obtained Markov chain zt will have posterior π(θ|x) as its stationary distribution therefore sample mean will serve as a reasonable estimate for the posterior mean. Putting aside for now an important issue of the rate of convergence to the stationary distribution and even the very fact of this convergence which has to be proven yet, we may report the successful solution of both problems posted in the beginning of the section. Let us also introduce the following natural generalization of this algorithm, allowing one to introduce the step-by-step changes in the proposal.

Suppose that for any θ1 ∈ Θ we can sample from the conditional density g(θ∣θ1) defined on Θ, playing the role of the proposal kernel. Then we will build a Markov chain ztt = 1, …, N using the modified algorithm below.

Metropolis–Hastings Algorithm (MHA)

For t = 0 choose any z0 from Θ. Define the ratio

For t = 1, 2, …

  • Generate a random proposal ytg(θ∣zt − 1);
  • Generate a random utUnif[0, 1];
  • Define
    numbered Display Equation

Avoiding fine details of the actual proof, we can still suggest a sketch of the justification of general MHA. It is based on the verification of the detailed balance equation (3.56). Suppose that we redefine the ordering on Θ using (4.3):

numbered Display Equation

Evidently, the detailed balance equation will not work with the proposal without a further accept/reject step:

numbered Display Equation

but if we define acceptance probability α(θ1, θ2) = min {1, Rg1, θ2)} and transition kernel g*(θ1∣θ2) = α(θ1, θ2) × g1∣θ2), reflecting possible rejection of the proposal g, it is easy to see that

Indeed, let us assume first that θ1⪯θ2. In this case by definition

numbered Display Equation

numbered Display Equation

and both the left-hand side and the right-hand side in (4.4) are equal to the product f1) × g2∣θ1). On the contrary, if θ2⪯θ1, then

numbered Display Equation

numbered Display Equation

and both sides in (4.4) are equal to f2) × g1∣θ2). This means that a Markov chain with the transition kernel g* = αg, constructed by accepting the proposal with kernel g with probability α, whatever the proposal is, will converge to the stationary target f. For an elegant geometric interpretation of the MHA, see Billera and Diaconis [2].

There are publications (see, e.g., Dongarra and Sullivan [5]) expressing an opinion that the Metropolis–Hastings algorithm is one of the 10 most influential computer algorithms of the twentieth century. One can argue exact rankings like this, but the story of the MHA is indeed impressive. It can be dated back to the early 1950s. The early history of the algorithm is fairly well known and is associated with Los Alamos, NM. As many other inventions influencing the development of the discipline of statistics, it is closely tied to its applications. Out of the five authors of the paper [20], which introduced and justified the algorithm in its independent version (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller), the name of Edward Teller is by far the most notorious. Teller, along with Stanislaw Ulam, was one of the two most prominent physicists associated with the H-bomb project.

A certain cloud of controversy surrounds the invention and real authorship of MHA. Besides Teller, other physicists such as Ulam and even Enrico Fermi have been mentioned as possible sources of its idea (Teller [31]). Another honorary mention is due to the emergence of a brand new (then) computer MANIAC (1952), which made it possible to implement the algorithm numerically. Nevertheless, it is probably fair that the algorithm was named after Nicholas Metropolis, which seems to be justified by more than the pure alphabetic order of the authors of the above mentioned seminal paper [20]. His contribution to the paper and his role in the development of both MANIAC and MHA well deserves this level of appreciation (for more details, see Robert and Casella [27]). The second name associated with MHA is that of a Canadian statistician W. K. Hastings, who further generalized Metropolis algorithm and put it in its current mathematical framework in his later paper [15], which was further developed in Peskun [21].

MHA was historically the first of the sampling methods organizing simulation of a random sample in a Markov chain fashion, which were given the name of MCMC: Markov chain Monte Carlo. Most of the popularity of MCMC methods nowadays is related to their use in Bayesian statistics, and their modern development is to a large extent connected with the needs and demands of Bayesian statistical models.

The initial independent version of MHA was successfully complemented by further development including such modifications as random walk algorithms considered in Section 4.3. It was also closely followed by a very different idea of Gibbs sampling discussed in Section 4.4. The rest of the chapter deals with the diagnostic issues and provides a brief survey of the modern development of MCMC.

4.3 Random Walk MHA

When we consider any problem of parametric estimation, we can identify two main sources of error. One is the bias and another is the variance of the estimators as a random variable. The bias, which is detrimental for the accuracy of estimation, is defined as the systematic error or b(φ) = E(φ*) − φ, where φ is the parameter of interest and φ* is its estimator. The variance is responsible for the estimation’s precision. The mean squared error of estimation, which is routinely considered a natural measure of error or the most popular loss function can be defined as

numbered Display Equation

If we consider estimation of posterior mean φ = E(θ∣x) to be the main purpose of the MHA, it is logical to try to accomplish two tasks: to keep the bias at bay, and also to minimize the variance of the estimator of φ obtained by Markov chain Monte Carlo sampling. Let us start with the variance considerations. Keeping in mind that according to the principle of Monte Carlo methods, the MHA estimator of the posterior mean is just the sample mean of the constructed chain, we can analyze the variance of the chain in the same fashion as we did for correlated samples in Section 3.4. Theorem 3.2 still works and identifies positive correlation between the sample elements as the enemy: the main source of variance.

Looking back at Figure 4.6, we can detect several flat segments, where the chain does not move and gets stuck at certain states for a few steps. The same picture could be observed for a longer chain in Figure 4.5, it is just harder to visualize. The most obvious reason for that is a too low acceptance rate, defined as the overall proportion of the new proposed states, which are accepted by the algorithm. Low acceptance rate by construction of independent Metropolis algorithm may be caused by a poor choice of the proposal bringing about too many rejections. If the proposed states are very “light” or improbable from the target point of view, they are likely to be rejected by the MHA due to their low acceptance probability (4.1). What is the problem with flat segments? They indicate high positive autocorrelation, which was defined in Section 1.7 as the correlation between the adjacent elements of the chain. Thus the overall variance in the chain stays high according to Theorem 3.2.

The simplest solution to the problem of flat segments and high positive autocorrelation would be a more adequate choice of the proposal. Better proposals could be sought for in the framework of IMA. However they are not that easy to find. We do not want a proposal to be underdispersed or tightly concentrated at certain regions of the state space. Underdispersed proposals tend to ignore other regions, which is especially dangerous if these other regions are target-heavy and need to be visited in order to build a representative sample from the target. The evident advantage of the uniform proposal used in ten coins example is its being overdispersed in the meaning of allowing greater variability than the target. It runs around the state space, and the low acceptance rate is just the price we have to pay for overdispersion. The curse of IMA is having to choose a proposal avoiding both extremes: underdispersion and low acceptance rate due to overdispersion (the latter being a lesser evil).

An opportunity to maneuver between these two extremes lies in a more flexible general MHA structure. Instead of wild up-and-down jumps between the states dictated by an overdispersed independent proposal often rejected by the MHA, we can try to propose each new state relatively close to the previous one guaranteeing a more pedestrian traveling style across the state space.

This approach is epitomized by the random walk Metropolis–Hastings algorithm (RWMHA). In its most popular version the conditional proposal in (4.3) is chosen as

(4.5)numbered Display Equation

An important property of this proposal is its symmetry: g2∣θ1) = g1∣θ2), which reduces expression (4.3) to

numbered Display Equation

Other choices of symmetric proposals are certainly possible, but normal distribution with varying σ usually provides sufficient flexibility. An important calibration or scaling issue is the choice of parameter σ [28]. Too large σ can bring about excessive overdispersion and in some cases even force the proposal to leave the target state space, which is undesirable. Too small σ leads to a chain with a very high acceptance rate, which is not necessarily good news. Such a chain may take a long time to travel across the state space, and guess what: with a small enough variance, it can exhibit a substantial bias!

The main problem can be stated as follows: convergence of an underdispersed random walk sample to the target is guaranteed, but it can be slow. The first values of the chain (and it is not clear, how many of these) may have a distribution far away from the target, and it is not clear how long it takes to get the chain into the stationary mode.

Figures 4.8, 4.9, and 4.10 illustrate the performance of RWMHA for the ten coins example (compare with IMA chains in Figures 4.5 and 4.6). Filled circles represent the accepted values and the solid line corresponds to the cumulative sample means providing the current estimates for the posterior mean as the chain runs on. Clearly, the choice of σ = 0.2 in Figure 4.8 with no qualifying conditions proves unfortunate, because the chain leaves the state space after 300 steps and drifts out. The chain is way overdispersed though its calculated acceptance rate 0.806 is rather high.

Graph showing the normal soild curve and dark solid curve are drawn and these curves p.d.f. part increases along with the increase in y on the x-axis simultaneously with s=0.2.

Figure 4.8 Ten coins RWMHA: σ = 0.2.

Graph showing the dark solid curve along with dark spots plotted on it where the curve increases with increase in the x-axis and y axis points with s=0.1.

Figure 4.9 Ten coins RWMHA: σ = 0.1.

The moderate value of σ = 0.1 in Figure 4.9 brings about good mixing, which is the standard term for a nice travel pattern around the state space and a reasonable numerical estimate of 0.566 for the posterior mean equal to 0.559 with the acceptance rate of 0.66. Convergence to the posterior mean is decent even after 200 or 300 steps of the chain.

On the other hand, too small value of σ = 0.01 in Figure 4.10 causes an overly high acceptance rate of 0.977, and while the final numerical estimate of 0.599 for the posterior mean is sufficiently bad, just look what a disaster it could have been if we stopped at N = 300 instead of moving on to 1000 iterations. A poor choice of initial value for an RWMHA chain causes a substantial bias of the sample mean as the estimate of the posterior mean. Also, high acceptance rate with low σ values does not prevent positive autocorrelations.

Graph showing the dark solid curve along with dark spots plotted on it where the curve decrease with increase in the x-axis point with s=0.01 spots are in sequence.

Figure 4.10 Ten coins RWMHA: σ = 0.01.

We will revisit the issues of acceptance rate, bias, and variance reduction in a more general context. At this point we have to admit that there is a number of issues with the choice of proposal for the MHA implementation. These issues are magnified in multidimensional parameter spaces causing slow or even breaking convergence of the chain to the target. In the next section we consider a different MCMC algorithm, which is specifically devised for multiparametric case and in many instances proves to be more effective than the MHA.

4.4 Gibbs Sampling

Triangular Uniform Distribution

Let us begin with a simple example. Consider state space S = [0, 1] × [0, 1] and the target distribution with density f(x, y) = 2, if x + y ≤ 1 (uniform distribution on A = {(x, y): x + y ≤ 1}). How can we generate a sample from this distribution, which is just uniform on A? In a fashion similar to the Monte Carlo estimation of the number π in Section 3.3, we can generate a sample on the unit square S, and then use an accept/reject procedure to throw out extra points not satisfying condition x + y ≤ 1. The only problem consists in losing roughly half of the initial sample during the accept/reject procedure. We can save some time by using the fact that the bivariate target distribution obtains two simple one-dimensional conditional distributions with densities f(xy) ∼ Unif(0, 1 − y) and f(yx) ∼ Unif(0, 1 − x). Drawing from each of these uniform distributions is easy.

Let us define an iterative process:

For t = 0 choose x0 uniformly from [0, 1]. Then choose y0 uniformly from [0, 1 − x0].

For t = 1, 2, …

  • Generate a random value xtUnif[0, 1 − yt − 1];
  • Generate a random value ytUnif[0, 1 − xt];
  • Use pairs (xt, yt) as a sample from Unif(A).

The resulting sample will neatly fill out the triangular region A. This procedure involving dimension reduction was possible because of our knowledge of conditional distributions and their relatively simple form. It allowed us to sample from univariate conditionals rather than directly from two-dimensional joint distribution. This idea lies in the foundation of Gibbs sampling. In general, Gibbs sampling scheme for two parameters could be described as follows.

Gibbs Sampling

Suppose f(θ, φ)∝π(θ, φ∣x) is (to within a multiplicative constant) the target density with data x (assumed to be known) and two parameters θ and φ. Let f1(θ) and f2(φ) be two marginal densities, and

numbered Display Equation

are two conditional densities. If direct sampling from conditional is feasible, the algorithm is as follows:

For t = 0 choose θ0 arbitrarily. Then draw φ0 from g2(φ∣θ0).

For t = 1, 2, …

  • Generate a random value θtg1(θ∣φt − 1);
  • Generate a random value φtg2(φ∣θt);
  • Use pairs (θt, φt) as a sample from f(θ, φ).

The name associated with the algorithm is somewhat misleading, because Gibbs himself had nothing to do with its idea. Geman and Geman [12] introduced the term developing the sampling schemes for Gibbs random fields while working on image processing problems. Gelfand and Smith [10] were probably the first to point out the full potential of Gibbs sampling (and MCMC in general) to Bayesian statistics.

By construction, each step t of the algorithm is determined by the state at the previous step t − 1, therefore the resulting sample forms a Markov chain. Convergence to the target is achieved by satisfying the detailed balance equation. It is easy to see that updates of individual parameters with proposals g1 and g2 are reversible as f1(θ)g2(φ∣θ) = f2(φ)g1(θ∣φ), but entire tth step might not be, and it is much harder to prove that the stationary distribution of the chain is indeed the target. It was first done by Besag [1] under certain additional conditions. For a more detailed discussion see Gamerman and Lopes [9].

Normal Distribution

The following example (see Robert and Casella [25]) addresses the problem of sampling from bivariate normal distribution. This distribution was defined in Section 1.6 and will play important role in further discussions in Chapters 5 and 6.

Suppose (θ, φ) ∼ BN(0, Σ), with zero means and unit covariance matrix

numbered Display Equation

where ρ = ρ(θ, φ) measures the correlation between two components of the random vector.

The algorithm below is a Gibbs-style Markov chain sampling emphasizing the role of conditionals. An alternative direct sampling scheme not involving conditionals is introduced later in Section 5.5 and used in Section 6.6.

It is known from Chapter 1, that both marginals f1(θ) and f2(φ) are standard normal, and the conditionals are also normal. They can be represented as

numbered Display Equation

Using standard procedures for sampling from univariate normal distributions, we will apply Gibbs sampling.

For t = 0 make an arbitrary choice of θ0 = c. Then draw φ0 from Nc, 1 − ρ2). By construction, ρ(θ0, φ0) = ρ.

For t = 1, 2, …

  • Generate a random value θtN(ρφt − 1, 1 − ρ2);
  • Generate a random value φtN(ρθt, 1 − ρ2);
  • Use pairs (θt, φt) as a sample from BN(0, Σ).

The correlation structure is maintained at every step, since by construction we get ρ(θt, φt) = ρ. Notice that due to arbitrary choice of the initial point θ0 = c this algorithm does not immediately start at the stationary distribution and it takes some time to get there. However, it is easy to see that two substeps at step t may be combined to yield the following one-step t − 1 to t conditionals:

numbered Display Equation

and by recursion

numbered Display Equation

so that for t → ∞ the distribution of θt (and similarly of φt) converges to standard normal.

Gibbs sampling algorithm can be clearly generalized to the case of three or more parameters. It could be technically considered to be a version of the Metropolis–Hastings algorithm for multiple parameters without the rejection part. As one can see, the main benefit of this method is avoiding any accept/reject steps and not losing any generated sample elements. The key is: we need to know how to sample directly from conditional distributions, which is fortunately possible in many interesting applications. However, when posteriors are too complicated and their conditionals are not analytically tractable, as happens to be the case with most applications in the second part of the book, Gibbs sampling cannot be productively used and the MHA or its modifications seem to be the last resort. “Metropolis within Gibbs” is a hybrid sampling scheme, which can be applied when some of the conditional distributions are tractable, and some are not, necessitating application of the MHA for some particular components of the vector parameter. In this book we will avoid a too detailed discussion of Gibbs sampling, concentrating instead on the diagnostics issues which are common for all MCMC methods.

4.5 Diagnostics of MCMC

Discussing the rate of convergence of general Monte Carlo methods in Chapter 3, we made the main emphasis on variance reduction. Monte Carlo estimates were obtained as the sample means of the samples drawn directly from the posterior distribution. Sample means converge to the theoretical posterior mean by virtue of the Law of Large Numbers. For Markov chain Monte Carlo methods we have to consider an additional source of error: the sample does not come directly from the posterior distribution, it just converges to the posterior, and the rate of this convergence has to be taken into account. Arbitrary choice of the initial point and the behavior of the chain in its randomly long initial segment before it gets into the stationary distribution cause a possible bias of the MCMC estimators. Rate of convergence for various MCMC procedures including the MHA and Gibbs sampler can be estimated by rigorous mathematical tools, but such results are often very technical and complicated [26]. Therefore monitoring and reducing both bias and variance based on the chain output plays an important role in practical applications of MCMC techniques.

4.5.1 Monitoring Bias and Variance of MCMC

The main visual tool which can be used for monitoring the behavior of an MCMC chain is its time series plot or trace plot. In case of multiple parameters trace plots can be drawn for each individual component of the vector parameter. Trace plots are equally applicable to the MHA chains and Gibbs sampler chains. Without using this exact name, we have seen many trace plots already for most of the examples in Chapter 4. For instructive purposes we tended to show both accepted and rejected states for the MHA in Figures 4.5 and 4.6. To avoid unnecessary confusion, one may want to drop the rejected points from consideration (recording the overall acceptance rate as an additional monitoring tool) and show the time series of all the accepted states against the time scale appended with the path of accumulated sample means. Examples of trace plots were presented in Figures 4.8, 4.9, and 4.10.

The visual clue suggesting good convergence is the stabilization of the time series of accepted states around the centerline of accumulated means, achieved in the case of Figure 4.9. It is evidently an imprecise and subjective criterion of convergence and it has to be confirmed by more definitive statistical procedures.

Another standard graphical summary of a Markov chain is its histogram. It should be mentioned that histograms do not preserve the order of observations in the chain and thus lose potentially important information which is captured in trace plots. However, histograms provide a valid insight on the shape of the sampling distribution of the chain which can be compared to the posterior density if its form is known. Such features as multimodality, skewness, and fat tails visible at an MCMC histogram can be considered a serious diagnostic warning if they are not expected from the posterior density. Histograms of the chains from Figures 4.9 and 4.10 are demonstrated in Figures 4.11 and 4.12, the latter looking decent but centered at a completely wrong point.

Bar graph showing the RWMHA histogram where s=0.1 is fixed and highest bar is at above 200 on 0.6 x-axis and on nearby 0 on 0.3 of x-axis respectively.

Figure 4.11 Ten coins RWMHA histogram: σ = 0.1.

Bar graph showing the RWMHA histogram where s=0.01 is fixed and highest bar is at above 150 on 0.65 x-axis and on nearby 0 on 0.50 of x-axis respectively.

Figure 4.12 Ten coins RWMHA histogram: σ = 0.01.

As we have noticed, high variance in an MCMC sample may be caused by positive autocorrelations in the chain, and in case of multiple parameters also by cross-correlations. Therefore the sample autocorrelation function (SACF) is a valuable additional tool for monitoring variance. Sample autocorrelation lag k for a finite time series yt, where t = 1, …, N is defined as

(4.6)numbered Display Equation

Graphs of SACF for the chains from Figures 4.9 and 4.10 are shown in Figures 4.13 and 4.14.

Graph showing the dark spots linked with a thin line to the x-axis are shown where the highest is at 1.0 and 0 on y and x-axes and it decreases and goes to the negative of y axis.

Figure 4.13 Ten coins RWMHA SACF: σ = 0.1.

Graph of dark spots linked with a thin line to the x-axis are shown where the highest is at 1.00 and 0 on y and x-axes and it decreases and goes to the 20 & 0 of x & y axis.

Figure 4.14 Ten coins RWMHA SACF: σ = 0.01.

The vertical lines on the graphs correspond to the values of sample autocorrelation r(k) with increasing lag k. Ideally we want to see the autocorrelations to be negligible or at least cut off or die down when the lag increases, as it happens in Figure 4.13. Sometimes negative autocorrelations can bring about faster convergence as stated in Theorem 3.2, but positive correlations as observed in Figure 4.14 inevitably slow it down.

A different approach to monitoring convergence requires running multiple chains simultaneously and projecting them on one multiple chain plot using the same time scale. Intuitively, when chains starting at several distant states stick together, this is a clear indication of the convergence to the stationary state. To prevent multiple chain diagrams from becoming too messy, it might be reasonable to drop the plots of accepted states and picture only multiple time series of accumulated means as done in Figures 4.15 and 4.16. Pay attention to the difference in vertical scales of these figures!

Graph showing the multiple chains for fixed s=0.1 where the curve is dark solid line with a look like frequency wave for three different values but quite similar.

Figure 4.15 Multiple chains for σ = 0.1.

Graph showing the multiple chains for fixed s=0.01 where the curve is dark solid line with a look like frequency wave for three different values and plotted differently.

Figure 4.16 Multiple chains for σ = 0.01.

Though multiple chain plots are often very illustrative, clearly exposing the lack of convergence, there is always a question: is it worthwhile spending time on running m small chains of length N rather than running one long chain of length mN, which certainly has more time to converge than each of the smaller chains?

4.5.2 Burn-in and Skip Intervals

All of the trace plots or multiple chain graphs in the previous subsection exhibit very erratic behavior for the first 100 steps or so, and stabilize (or not) later on. This is to be expected since the initial state was chosen in an arbitrary fashion causing potential bias of the estimation, and its effect lingers on in the chain until it is completely forgotten when and if the chain falls into the stationary state. One simple but hard decision is to cut the starting segment off in order to reduce the bias. This starting segment is known as the burn-in phase of the chain. When the decision is made, the first b sample elements are disregarded, and all further inference is based on the remaining sample size Nb.

This decision is hard for two reasons. First, it is always bad to lose data, even if they are somewhat compromised. The second reason is our very vague idea regarding the exact point where this cut-off has to be implemented. If we cut too early, we do not get rid of the bias, and if we cut too late, we throw out some completely valid data. The stochastic nature of the MCMC time series means that all inputs being equal, some chains will randomly behave better than the others, so no pre-fixed value b is going to ideally fit all possible trajectories of the process.

Some usual choices of b are between 0.1N and 0.5N, which is not a definitive rule. In Figure 4.9, the choice of b = 0.1N = 100 seems to be plausible, cutting off most of the “garbage” observations in the beginning of the chain, and dramatically improving convergence. In Figure 4.6, the choice of b = 0.1N = 10 also improves convergence, while in Figure 4.5 it does not seem to be needed. Recommendation on appropriate burn-in phase is an important part of some diagnostic checks.

Using burn-in to reduce the bias, we also achieve a certain reduction of variance, which is likely to be higher in the initial segment of the chain. However, a more radical way to reduce the variance is to fight autocorrelation. If the adjacent values of the MCMC time series are too close to each other thus highly positively correlated, as it happens in Figure 4.10, which can be observed in Figure 4.14, or in case of flat segments due to low acceptance rate as observed in Figure 4.6, one can consider a possible rarefaction of the chain or skipping some of the adjacent values. Defining a skip interval k means taking in consideration only each kth element of the sample: t = 1, k + 1, 2k + 1, …, and throwing out everything else. This approach brings about a radical reduction in the sample size: from N to N/k, which may be costly.

There exist some general results describing conditions under which no burn-in or skip interval are beneficial at all (see [18]). However both procedures are of a certain value and can be implemented when they are called for as a result of diagnostic checks.

4.5.3 Diagnostics of MCMC

In this section we will briefly describe several most popular techniques assessing the rate of convergence of MCMC sampling, addressing both bias and variance. All of these methods can be applied to many different MCMC settings including both the MHA and Gibbs sampler and can be used for vector parameters. This description is far from complete, and there exist many other methods of convergence diagnostics. Justification of most of the diagnostic tools discussed in this subsection requires a serious level of mathematical competence and is far beyond the level of the book. Nevertheless their basic ideas can still be presented in an intuitive way, which we are trying to do. Even if the readers are not given enough instruction to run specific diagnostic checks on their own, they will have some familiarity with the topic when invited to run these checks built into special software packages (e.g., BUGS, CODA, JAGS, Stan, and various R tools [17]).

Heidelberger–Welch: Stationarity Test

Using the general theory of time series, we can describe an MCMC chain as stationary (no drift, constant variance, and autocorrelations) when it is in or close to its stationary state. On the contrary, if a chain is not stationary, especially in its initial states, it might be beneficial to introduce a burn-in period, cutting off its nonstationary part. Heidelberger and Welch [16] propose to apply a stationarity test based on Cramer–von Mises statistic to the entire chain length N. If the chain passes the stationarity test, no burn-in is needed. If the test fails, introduce the burn-in of b = 0.1N, and repeat. If failed again, increase the burn-in to 0.2N and if needed, to 0.5N. If the test still fails, it is necessary to increase the total length N of the initial chain, say, to 1.5N and repeat the process. The procedure itself is transparent other than the determination of the critical values of Cramer–von Mises statistic. It helps to detect a bias in the chain and determine optimal length of the burn-in period.

Geweke: Comparing Two Segments of the Chain

If the ultimate goal of the MCMC sampling is to estimate the posterior mean, it is natural to compare its estimates based on two different segments of the chain. Geweke in [13] suggested using the first 0.1N and the last 0.5N sample elements and compare the sample means for these two subsamples, which are supposed to be close enough. The test is based on the difference of the subsample means. As with the Heidelberger–Welch’s method, the procedure is very simple, and the devil is in the detail: how to determine the critical value of the test statistic? Geweke’s method helps to detect bias and address the variance issues, suggesting the reasonable length of the burn-in.

Raftery–Lewis: Quantile Estimation

Suppose the objective of running MCMC chain is to estimate a posterior quantile (most popular being the median, but other percentile points might be of interest too). In the case of quantile estimation, an MCMC time series can be treated as a binary Markov chain with two states: above and below a certain bound. This simplification makes it possible (see [24]) to use properties of binomial distribution in order to estimate such characteristics as the chain length N, burn-in b, and skip interval k after one single trial run of the chain.

Gelman–Rubin: Multiple Chains

The most popular MCMC diagnostic tool so far, is the shrink factor R introduced by Gelman and Rubin in [11], based on normal approximation for Bayesian posterior inference and involving a multiple chain construction. Gelman–Rubin’s method requires first to draw initial values for multiple chains from an overdispersed estimate of the target distribution. The number of chains should be substantial: at least 10 for a single parameter and more for multiple parameter cases. All chains run for time N, and the second halves of the chains (0.5N observations per chain) are used for the estimation of pooled within-chain variance W and between-chain variance B depending on the number of chains. Overall estimate of variance is V = (1 − 2/N)W + (2/N)B and the shrink factor is determined as

numbered Display Equation

For a good convergence of MCMC process all chains are supposed to shrink together and the shrink factor should be close to 1.

Main criticism of Gelman–Rubin diagnostic relates to such issues as the sensitivity of results to the choice of starting points, validity of normal approximation, and necessity to discard a huge volume of observations (mainly, first halves of multiple overdispersed chains), which may not be further used in the inference.

The Metropolis–Hastings algorithm and Gibbs sampling are just two instances of general MCMC sampling methods. The main problem of these methods: the sample does not start at the target distribution, and it needs time to get there. Modern development of MCMC is driven to a large extent by two objectives: suppressing the bias and reducing the variance. In order to accomplish the first objective, one needs to be able to get to the stationary distribution as fast as possible. The second objective in case of the MHA is usually achieved by the choice of the proposal. In the next section we can briefly describe several directions of modification of basic MCMC algorithms in view of these two objectives.

4.6 Suppressing Bias and Variance

4.6.1 Perfect Sampling

The problem of determination of the length of the burn-in period would be resolved if we only knew the exact time when the chain gets into its stationary state, achieving the target distribution. The general properties of Markov chains state that after a chain gets into the stationary state, it stays there forever. Convergence has been achieved and since that moment on we can draw directly from the target distribution. The intuitive justification of this fact is the understanding that in its stationary state the chain has completely forgotten its past.

Returning to the Sun City example with just two states forming the state space, the bias is related to a difference between two possible starting points. Beginning simulation at a rainy day or a sunny day matters. However, if we start two chains: one from sun and one from rain, as soon as they hit the same state X at time T, which is known as coupling (and sooner or later this is going to happen), the past becomes irrelevant. The future behavior of both chains is determined only by the common state they share at time T, and not by the previous history. Therefore there is no necessity to run two chains any more, we can run one chain starting from the state X, and as soon as the observations prior to time T are discarded (burn-in), the bias is eliminated.

Imagine now that a chain has the state space with a finite number k of states. Then if we start k chains at time 0 from all possible states, and when any two chains couple, proceed with just one chain instead of these two, so that the total number of chains to run is reduced by 1. Sooner or later all the chains are going to couple together at some state X. The time T when it is achieved is our new starting point, and from there on we proceed with one chain which is known to stay at the stationary distribution creating a perfect sampling or exact sampling scheme. This procedure introduced by Propp and Wilson [23] is known as the coupling from the past (CFTP).

The price we pay for getting a perfect sample from the target distribution is high. A total of kT chain elements are being discarded. Finite state spaces are also not the most interesting ones. There exist modifications of CFTP to infinite state spaces.

An alternative technique such as backward sampling or backward coupling suggested by Foss and Tweedie in [7] using fine mathematical properties of Markov chains is one such modification. Instead of running multiple chains from the past into the future, it requires running one path backward using the reversibility condition, and after achieving a specific state in the past at time − T, running a forward chain from that state. It can be proven that under rather wide assumptions, this one backward/forward path will end up at the stationary state at time 0. However, this backward/forward run may take an unlimited amount of time, thus rendering the entire procedure hardly workable without additional adjustments. It seems like perfect sampling has a lot of promise but hardly has proven yet to be superior to other MCMC methods. For more recent development we can refer to two excellent surveys [4] and [14].

4.6.2 Adaptive MHA

An idea of variance reduction in MCMC can be addressed in many different ways. One of the most obvious ideas is related to the fact that “bad mixing” and slow convergence of chains due to high variance can be caused by a bad proposal distribution, which is too far from the target. Bad proposals are inevitable when we do not know much about the target. However, by observing a Monte Carlo chain, we may learn more while the chain still runs. Can we modify proposal “on the run,” adjusting it as the chain goes?

Adaptive algorithms will do exactly that, using information obtained from the past values of the chain to modify the proposal for the future. The main problem with this approach is the loss of Markov property (according to this property, the future of a chain given present state cannot depend on the past). However, there exist many sophisticated ways to prove that under certain conditions either the Markov property still holds or its loss does not prevent the chain from converging to the target. A combination of proposal adaptations with perfect sampling may be promising [29] but has not proven to be very useful yet. For more references, see also [14].

4.6.3 ABC and Other Methods

The last decades gave rise to many different MCMC techniques including particle filtering, slice sampling, Hamiltonian Monte Carlo, quasi-Monte Carlo, and so on. One of the more promising directions is approximate Bayesian computation or ABC, which provides for some solutions when likelihood is not completely specified and thus traditional MCMC methods fail [14]. Modern MCMC methods can be much faster, more precise, and more efficient than the basic Metropolis–Hastings algorithm and even Gibbs sampling. Unfortunately, trying to keep things simple, we have to avoid further discussion of modern MCMC tools, suggesting to the readers to use the multiple reference sources provided above.

4.7 Time-to-Default Analysis of Mortgage Portfolios

In this section we will get back to basics and provide an illustration of the use of the RWMHA. We will apply it to a practical problem of mortgage defaults and discuss some interesting diagnostic advantages of Bayesian approach when implemented via MCMC.

4.7.1 Mortgage Defaults

Mortgage defaults played a critical role in the first financial crisis of the twenty-first century. Indeed, the majority of these were initiated in the subprime sector of the mortgage market. Accordingly, the practice and foundations of subprime lending have received great attention in the literature. Crucial to these conversations is an understanding of the inherent risk in subprime lending. In this section we discuss a Bayesian reliability model for the probability of default (PD) in mortgage portfolios. This approach was first suggested by Soyer and Xu [30], and then developed by Galloway et al. [8]. In the latter work, survival analysis methods were applied to a subprime mortgage portfolio of a major commercial bank. We will briefly discuss some results of this study.

Subprime mortgages are typically higher interest contracts considered riskier and not complying to the “prime” status, which would bring about a lower interest rate. The subprime portfolio in question is not homogeneous, including mortgages initiated during a 10-year period, representing a variety of terms, conditions, interest rates, and different sectors of the mortgage market. Available data are de-personalized, stripped of any identification and aggregated. The month of inception and the month of default (if it happened) are the only variables recorded. Early payoffs or refinancing as other causes for early termination could be considered as competing risks, but the data on these are unavailable. A large number of unobserved factors makes it necessary to find a way to reflect the latent (hidden) heterogeneity in default rates among the mortgage holders.

Out of the many possible approaches to modeling PD, one suggests defining a time-to-default (TTD) variable measuring the life of the mortgage from the inception to the default. Constructing a parametric model for the distribution of the TTD variable for the entire portfolio will provide estimates of PD for any given period of time. TTD models are popular in many applications including insurance, reliability, and biostatistics. We will first refer to one recent application from a seemingly unrelated field of business statistics which has suggested a new approach to modeling mortgage PD.

4.7.2 Customer Retention and Infinite Mixture Models

A long-standing problem of “customer retention” measuring the length of the period of customer’s loyalty is especially important in the context of subscriptions which could be renewed or canceled, either at regular time intervals (e.g., monthly or annually) or in continuous time. This problem, historically related to journal subscription and customer loyalty programs, has attracted attention recently in connection with online subscription services. Customer retention time T is defined for a population of customers as a random variable measuring time until cancellation. It may serve as an indicator of success of marketing policy. A simple model for T can be derived from the discrete-time Beta-geometric (BG) model proposed by Fader and Hardie [6] and is very successful in describing subscription patterns. This model is based on the following assumptions.

  • (BG1) Cancellations are observed at discrete periods, that is, T ∈ {1, 2, 3, …}.
  • (BG2) The probability of cancellation for an individual customer, θ, remainsconstant throughout their holding. Thus T is described by the geometricdistribution defined in (1.7) with distribution function
    numbered Display Equation
  • (BG3) The probability of default, θ, varies among subscribers and can becharacterized by the Beta(α, β) distribution (1.25) with parametersα, β > 0 and probability density function
    numbered Display Equation

This is an example of infinite mixture model, where population heterogeneity is expressed by risk parameter θ pertaining to individual customers and randomly distributed across the population. Integrating θ out brings about the distribution of T depending on population parameters α and β. This use of conditional and marginal distributions resembles Bayesian approach, but its purpose may be considered as opposite: while in Bayesian approach we may use data T = t with prior fB to develop the posterior distribution of individual customer’s risk θ given T = t, infinite mixture approach leads us to the marginal distribution of T unconditioned on θ characterizing entire population of customers. This requires exactly the calculation of the integral in the denominator of the Bayes formula (2.26), which we were so happy to avoid in most of the examples of Chapter 2.

The discrete-time Beta-geometric model assumptions oversimplify the reality of the mortgage default setting. To begin, the TTD for mortgages is typically measured on a continuous-time scale, that is, defaults can occur at any time. This is not a big problem, since one can use a continuous-time exponential-gamma model, where exponential distribution with parameter θ measures the individual TTD and gamma distribution characterizes the distribution of θ over the population of customers. This model was also considered in [6]. In case of exponential distribution, the hazard rate as well as the probability of cancellation θ in BG model does not change with time.

Further, one has to take into account that while penalties for subscription cancellation may be substantial, they rarely get as dramatic as the consequences of a mortgage default. Depending on the circumstances of the mortgage portfolio and its holder, the hazard rate may increase or decrease throughout the duration of the holding. To accommodate these features, Fader and Hardie also proposed a continuous-time Weibull-gamma (WG) model for customer retention. We can translate it into the language of mortgage defaults, treating T as the TTD.

  • (WG1) Defaults can occur at any time, that is, T ≥ 0.
  • (WG2) The risk of default for an individual may increase or decreasethroughout their holding. Thus T can be characterized by the Weibulldistribution with probability density function fW(t|θ, τ) and corres-ponding distribution function
    numbered Display Equation
    Note that the parameter τ > 0 represents the magnitude and directionof change of the hazard rate (risk of default) over time with τ = 1indicating a constant hazard rate and τ > 1 (τ < 1) indicating anincrease (decrease) in the hazard rate. Further, θ > 0 represents thetransformed scale parameter of Weibull distribution with larger θreflecting a larger risk of default.
  • (WG3) The risk of default varies among mortgage holders. To this end, hetero-geneity in scale parameter θ is modeled by a Gamma(α, λ) distributionwith parameters α, λ > 0 and probability density function
    (4.7)numbered Display Equation

It follows that the joint Weibull-gamma distribution of (t, θ) is characterized by density function

numbered Display Equation

Finally, integrating θ out produces a form of the Burr Type XII distribution:

4.7.3 Latent Classes and Finite Mixture Models

The Weibull-gamma framework under assumptions (WG1)–(WG3) provides a foundation for modeling TTD. However, it does not reflect the segmentation of portfolios into groups with different default risk patterns. For instance, in (4.8) parameter τ determines the hazard rate, which is increasing or decreasing with time for the entire population. There exist two controversial intuitive justifications for these two patterns. First, with time the equity of the mortgage holder grows and therefore the losses associated with default also grow. That would cause a sensible mortgage holder to become with time less and less default prone. Second, as it is known, for many households in shaky financial situations, mortgage payments themselves represent the main burden driving down the family finance. This means that the later in the mortgage history with more payments being made, the higher becomes the risk of default. That would suggest the existence of at least two different segments of the population with different values of parameter τ of the WG model. Let us start with dividing population into two different groups.

To this end, let p ∈ (0, 1) be the segmentation weight reflecting the proportional size of the first group of mortgages and 1 − p be the size of the second group of mortgages. Further, assume default rates may fluctuate throughout the duration of the holding and may vary by group status. Then TTD T can be characterized by a Weibull segmentation model with probability density function fWS(t|θ, τ1, τ2, p) and corresponding distribution function

where τ1 and τ2 represent the magnitude and direction of change of hazard rates for two segments, respectively.

This is a finite mixture model, corresponding to two homogeneous Weibull segments of the population. If risk rates expressed by parameter θ are assumed to be constant across mortgage holders in each segment, then this is a suitable TTD model. Otherwise, heterogeneity in θ can be modeled for each of the two segments by a separate Gamma(α, λ) distribution. In conjunction with the previous equation, it follows that a Weibull-gamma mixture model (WGM) for T has distribution function

with corresponding density function fWGM(t|α, λ, τ1, τ2, p).

With these properties in mind, WGM is presented, which recognizes the segmentation of portfolios into two distinct groups while assuming heterogeneity in default risks among individual mortgage holders in each group. Similar strategies have been used to model heterogeneous survival data in various contexts. In fact, the WGM can be treated as an extension of the models proposed by Peter Fader and Bruce Hardie in a series of papers for predicting customer retention in subscription settings. Although mortgage defaults and subscription cancellations differ in nature and consequences, there are some similarities between these two. On the other hand, WGM can be considered to be an extension of TTD approach by Soyer and Xu [30].

Specification of the WGM requires the selection of the vector of model parameters (α, λ, τ1, τ2, p). The convention in financial and marketing applications is to estimate parameters using available data by maximum likelihood estimation (MLE) approach. However, this ignores potentially powerful prior knowledge and beliefs about model parameters. This knowledge may be based on the past experience and field expertise as expressed in the form of prior distribution. Also, MLE for mixture models often lacks robustness, being too sensitive to kinks in data. In contrast, we present a Bayesian alternative that evaluates parameters through a combination of observed data and prior knowledge. Though far from the standard, Bayesian applications in survival analysis settings are increasingly popular. We can refer to a study on Bayesian treatment of Weibull mixtures [19]. Another recent paper [22] applied Bayesian techniques to problems of loan prepayment, somewhat related to default analysis.

We illustrate the frequentist and Bayesian applications of the WGM using empirical data for a major American commercial bank [8]. These data include the monthly number of defaults for the subprime mortgage portfolios, pooled over 10 years from 2002 to 2011. We will see that in comparison to its MLE competitor, the Bayesian specification of WGM utilizing random walk Metropolis algorithm enjoys increased stability and helps to detect overparameterization. We consider two strategies below: the maximum likelihood method and Bayesian estimation.

4.7.4 Maximum Likelihood Estimation

In most financial and marketing applications, maximum likelihood methods of parametric estimation are the most widely used. Let t1, t2, ..., tn be the observed TTD for a random sample of n mortgage portfolios. Further, define likelihood function

numbered Display Equation

Taking into account aggregation (defaults recorded at discrete times—monthly) and right censoring of the data (finite period of observation), we will rewrite the likelihood functions. Let n be the total number of mortgages in the portfolio, m the observation period in months, and kj the number of defaults recorded at time (month) j for j = 1, …, m. Then

numbered Display Equation

numbered Display Equation

Maximum likelihood estimates for the parameters of the WGM are obtained by numerical maximization of the likelihood function. Figure 4.17 shows the fitted WGM parametric curve next to the smoothed histogram of the data. First 60 months of observations were used for parametric estimation (m = 60), and the rest of the observations (j > 60) were used for model validation. It is clear that the model demonstrates a very poor fit at the right-hand tail.

Graph showing the Weibull-Gamma mixture MLE curve for month versus number of defaults indicating the subprime mortgages and fitted subprime mortgages.

Figure 4.17 WGM model fit with MLE.

Will a Bayesian model perform better?

4.7.5 A Bayesian Model

Bayesian formulation of the WGM (4.10) requires a specification of prior distributions for the relevant parameters. These are chosen to reflect prior knowledge of the parameters based on prior experience and subject-matter expertise. The objective priors will be selected to reflect the absence of reliable prior information and will allow for the use of Bayesian methodology without introducing too much subjectivity. The proper choice of objective priors, according to Yang and Berger [32], is as follows:

numbered Display Equation

numbered Display Equation

numbered Display Equation

numbered Display Equation

numbered Display Equation

Note that the choice of noninformative priors for λ and θ is standard for most scale parameters, π(p) is a Beta(1/2, 1/2) prior, and the least intuitive and most exotic of all is the prior for the shape parameter for gamma distribution, which can also be expressed in terms of polygamma function . Assume also that these prior distributions are independent. Further, let (n, m, k1, k2, …, km) be the observed mortgage default data. Then the posterior distributions for the parameters of the WGM conditioned on the data are characterized by density functions

numbered Display Equation

numbered Display Equation

where LWGM is the likelihood function defined above.

In the Bayesian setting, let us run a random walk Metropolis–Hastings chain with normal conditional proposal, which we have illustrated in Section 4.3 for the case of one scalar parameter. With five-component vector parameter the simplest solution, though not the most economical one, will be to apply block updates, or update all five parameter components simultaneously on each step. It is obvious that the acceptance rate will suffer, because rejection will follow if even one of the five parameters is to be rejected. Therefore a longer chain is required to provide good mixing, and a more careful work has to be done on the fine tuning of the proposal, which in this case is reduced to the choice of standard deviations σ for all parameters. The results of RWMHA with a very long run of 3 million steps are illustrated in Figures 4.18 and 4.19 and Table 4.1.

Graph showing for month versus number of defaults indicating the subprime mortgages and fitted subprime mortgages which is fit with Bayes estimates.

Figure 4.18 WGM model fit with Bayes estimates (RWMHA).

Graph showing the 5 different curve forms iterations versus different parameters like a, r, C1, C2 and p which are indicated as trace plots for WGM model.

Figure 4.19 Trace plots for WGM model.

Table 4.1 Bayesian parameter estimates for WGM

Parameter Estimate Error Scaling Value
α 2.422 × 1011 2.824 × 109 σα 109
λ 1.022 × 1015 1.241 × 1013 σλ 5 × 1012
τ1 2.179 1.227 × 10− 3 0.05
τ2 9.386 × 10− 2 3.231 × 10− 3 0.2
p 0.199 1.293 × 10− 4 σp 0.007

Figure 4.18 representing the model fit looks like a vast improvement relatively to MLE. However, Table 4.1 tells a different story.

Parametric estimates for α and λ behave rather strangely. Table 4.1 suggests very large values for both, and scaling parameters are also very high. A good overall fit is achieved without clear convergence of MCMC estimates, which smells trouble. A closer look at the trace plots might provide additional insight.

As we see in Figure 4.19, traces for α and λ (two charts in the upper row) seem to move up and down in coordination (flat intervals correspond to long series of rejections). The other parameters (τi in the middle row and p in the lower row) behave reasonably well other than sharing the same flat intervals. The information provided by trace plots is valuable: parameters α and λ are closely dependent!

A possible interpretation of this phenomenon is strikingly simple: the model is over-parameterized. We know that if parameter θ of the Weibull distribution in our mixture has a gamma distribution Gamma(α, λ), then according to (1.19)

numbered Display Equation

and the case of both α → ∞ and λ → ∞ so that αλ ∼ c corresponds to θ = c being a positive constant, common for all individual observations. This consideration takes us back from WGM (4.10) to a model with fewer parameters (4.9) assuming homogeneity of risk rates θ in the population. We can characterize (4.9) as a Weibull segmentation model (WS) with no need to model heterogeneity within two segments. The results for this model (also using RWMHA) are demonstrated in Figure 4.20 (model fit), Table 4.2 (parameter estimates), and Figure 4.21 (trace plots). The overall results are much more attractive for a simpler model WS. The most interesting finding for the application might be the interpretation of parametric values from Table 4.2: two segments were detected. One including roughly 20% of the portfolio (value of p) may correspond to τ1 > 1 (increasing hazard rate). Another with roughly 80% includes mortgage holders with τ2 ≈ 0.1 < 1 (decreasing hazard rate), see [8]. It is interesting that this proportion has been found for some subprime loans by different authors [30].

Graph showing for month versus number of defaults indicating the subprime mortgages and fitted subprime mortgages for Weibull segmentation model.

Figure 4.20 WS model fit with Bayes estimates (RWMHA).

Graph showing the 4 different curve forms iterations versus different parameters like C1, C2, l and p which are indicated as trace plots for WS model.

Figure 4.21 Trace plots for WS model.

Table 4.2 Parameter estimates for the Bayesian WS model, subprime portfolios

Parameter Estimate Error Scaling Value
θ 2.381 × 10− 4 9.154 × 10− 7 σθ 3 × 10− 5
τ1 2.187 1.286 × 10− 3 0.02
τ2 0.115 5.286 × 10− 3 0.07
p 0.197 2.076 × 10− 4 σp 0.004

Aside from some intresting practical results, from methodological point of view this example illustrates additional diagnostic opportunities (trace plots) provided by MCMC.

References

  1. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B, 36(2), 192–236.
  2. Billera, L., and Diaconis, P. (2001). A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 16(4), 335–339.
  3. Bolstad, W. M. (2009). Understanding Computational Bayesian Statistics. New York: John Wiley & Sons.
  4. Craiu, R., and Meng, X. L. (2011). Perfection within reach: exact MCMC sampling, In: Handbook of Markov Chain Monte Carlo, 199–226. London, New York: Chapman & Hall, CRC Press, Boca Raton.
  5. Dongarra, J., and Sullivan, F. (2000). The top 10 algorithms. Computer Science and Engineering, 2, 22–33.
  6. Fader, P. S., and Hardie, B. G. S. (2007). How to project customer retention. Journal of Interactive Marketing, 21, 76–90.
  7. Foss, S. G., and Tweedie, R. L. (1998) Perfect simulation and backward coupling. Stochastic Models, 14(1–2), 187–208.
  8. Galloway, M., Johnson, A., and Shemyakin, A. (2014). Time-to-Default Analysis of Mortgage Portfolios, MCMSki-V, Chamonix, France.
  9. Gamerman, D., and Lopes,H. F. (2006). Markov Chain Monte Carlo. Chapman & Hall/CRC.
  10. Gelfand, A. E., and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal denisities. Journal of the American Statistical Association, 85, 398–409.
  11. Gelman, A., and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science, 7, 457–511.
  12. Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
  13. Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculations of posterior moments. In: J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, (editors), Bayesian Statistics, 169–193, Oxford: Oxford University Press.
  14. Green, P. J., Latusyznski, K., Pereyra, M., and Robert, C. P. (2015). Bayesian computation: a summary of the current state, and samples backwards and forwards. Statistics and Computing, 25(4), 835–862.
  15. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57, 97–109.
  16. Heidelberger, P., and Welch, P. D. (1983). Simulation run length control in the presence of the initial transient. Operations Research, 31, 1109–1144.
  17. Kruschke, J. (2014). Doing Bayesian Data Analysis. A Tutorial with R, JAGS and Stan, 2nd ed. Elsevier, Academic Press.
  18. MacEahern, S. N., and Berliner, L. M. (1994). Subsampling the Gibbs sampler. The American Statistician, 48, 188–190.
  19. Marin, J. M., Rodriguez-Bernal, M. T., and Wiper, M. P. (2005). Using Weibull mixture distributions to model heterogeneous survival data. Communications in Statistics Simulation and Computation, 34, 673–684.
  20. Metropoils, N., Rosenbluth, A. Rosenbluth, M., Teller, A., and Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
  21. Peskun, P. (1973). Optimal Monte Carlo sampling using Markov chains. Biometrika, 60, 607–612.
  22. Popova, I., Popova, E., and George, E. I. (2008). Bayesian forecasting of prepayment rates for individual pools of mortgages. Bayesian Analysis, 3, 393–426.
  23. Propp, J., and Wilson, D. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms, 9, 223–252.
  24. Raftery, A., and Lewis, S. (1992). How many iterations in the Gibbs sampler? In: J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith (editors), Bayesian Statistics, 763–773, Oxford: Oxford University Press.
  25. Robert, C. P., and Casella, G. (2004). Introducing Monte Carlo Methods with R, 2nd ed. Springer-Verlag.
  26. Robert, C. P., and Casella, G. (2010). Monte Carlo Statistical Methods. Springer-Verlag.
  27. Robert, C. P., and Casella, G. (2011). A short history of MCMC: subjective recollections from incomplete data. Statistical Science, 26(1), 102–115.
  28. Roberts, G.O., Gelman, A., and Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1), 110–120.
  29. Shemyakin, A. (2007). An adaptive backward coupling Metropolis algorithm for truncated distributions. Model Assisted Statistics and Applications, 2(3), 137–143.
  30. Soyer, R., and Xu, F. (2010). Assessment of mortgage default risk via Bayesian reliability models. Applied Stochastic Models in Business and Industry, 26, 308–330.
  31. Teller, E. (2001). Memoirs: A Twentieth Century Journey in Science and Politics. Perseus Publishing.
  32. Yang, R., and Berger, J. O. (1998). A catalog of noninformative priors, Technical Report 42, Duke University Department of Statistical Science.

Exercises

  1. Generate random samples size 100 and size 1000 for the Sun City example using Excel or R. First, use the “fair coin” proposal, and then apply accept/reject technique from Section 4.1. Compare the resulting estimates of the long-term proportion of rainy days with the theoretical probability 0.25.

  2. Implement IMA (using any software) to estimate posterior mean with Poisson likelihood for the data x = (3, 1, 4, 3, 2) and exponential prior with the mean 2, also known as Gamma(1, 0.5). Choose an exponential proposal density with the parameter value you consider the best. Modify the proposal if you need it badly enough. Use sample size N = 1000. Draw relevant graphs: trace plot, histogram, and sample autocorrelation function.

  3. Implement RWMHA (using any software) to estimate posterior mean with Poisson likelihood for the data x = (3, 1, 4, 3, 2) and exponential prior with the mean 2, also known as Gamma(1, 0.5). Choose a normal proposal density with the parameter value σ you consider the best. Modify the proposal if you need it badly enough. Use sample size N = 1000. Draw relevant graphs: trace plot, histogram, and sample autocorrelation function.

  4. Calculate Geweke statistics for the chains obtained in Exercises 4.2 and 4.3. Compare these values.

  5. Calculate Gelman–Rubin shrink factors for the chains obtained in Exercises 4.2 and 4.3. Compare these values.

  6. Apply burn-in with b = 0.1N and b = 0.5N to the chain obtained in Exercise 4.2. Apply skip with k = 10. Does burn-in or skip or both improve your estimates?

  7. Apply burn-in with b = 0.1N and b = 0.5N to the chain obtained in Exercise 4.3. Apply skip with k = 10. Does burn-in or skip or both improve your estimates?

  8. Calculate Geweke statistics for the chains obtained in Exercises 4.6 and 4.7. Compare these values.

  9. Calculate Gelman–Rubin shrink factors for the chains obtained in Exercises 4.6 and 4.7. Compare these values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset