Chapter 12

Computationally efficient Markov chain simulation

The basic Gibbs sampler and Metropolis algorithm can be seen as building blocks for more advanced Markov chain simulation algorithms that can work well for a wide range of problems. In Sections 12.1 and 12.2, we discuss reparameterizations and settings of tuning parameters to make Gibbs and Metropolis more efficient. Section 12.4 presents Hamiltonian Monte Carlo, a generalization of the Metropolis algorithm that includes ‘momentum’ variables so that each iteration can move farther in parameter space, thus allowing faster mixing, especially in high dimensions. We follow up in Sections 12.5 and 12.6 with an application to a hierarchical model and a discussion of our program Stan, which implements HMC for general models.

12.1   Efficient Gibbs samplers

Transformations and reparameterization

The Gibbs sampler is most efficient when parameterized in terms of independent components; Figure 11.2 shows an example with highly dependent components that create slow convergence. The simplest way to reparameterize is by a linear transformation of the parameters, but posterior distributions that are not approximately normal may require special methods.

The same arguments apply to Metropolis jumps. In a normal or approximately normal setting, the jumping kernel should ideally have the same covariance structure as the target distribution, which can be approximately estimated based on the normal approximation at the mode (as we discussed in Chapter 13). Markov chain simulation of a distribution with multiple modes can be greatly improved by allowing jumps between modes.

Auxiliary variables

Gibbs sampler computations can often be simplified or convergence accelerated by adding auxiliary variables, for example indicators for mixture distributions, as described in Chapter 22. The idea of adding variables is also called data augmentation and is often a useful conceptual and computational tool, both for the Gibbs sampler and for the EM algorithm (see Section 13.4).

Example. Modeling the t distribution as a mixture of normals

A simple but important example of auxiliary variables arises with the t distribution, which can be expressed as a mixture of normal distributions, as noted in Chapter 3 and discussed in more detail in Chapter 17. We illustrate with the example of inference for the parameters μ, σ2 given n independent data points from the tν(μ, σ2) distribution, where for simplicity we assume ν is known. We also assume a uniform prior distribution on μ, log σ. The t likelihood for each data point is equivalent to the model,

where the Vi’s are auxiliary variables that cannot be directly observed. If we perform inference using the joint posterior distribution, p(μ, σ2, Vy), and then just consider the simulations for μ, σ, these will represent the posterior distribution under the original t model.

There is no direct way to sample the parameters μ, σ2 in the t model, but it is straightforward to perform the Gibbs sampler on V, μ, σ2 in the augmented model:

1.  Conditional posterior distribution of each Vi. Conditional on the data y and the other parameters of the model, each Vi is a normal variance parameter with a scaled inverse-2 prior distribution, and so its posterior distribution is also inverse-2 (see Section 2.6):

The n parameters Vi are independent in their conditional posterior distribution, and we can directly apply the Gibbs sampler by sampling from their scaled inverse-2 distributions.

2.  Conditional posterior distribution of μ. Conditional on the data y and the other parameters of the model, information about μ is supplied by the n data points yi, each with its own variance. Combining with the uniform prior distribution on μ yields,

3.  Conditional posterior distribution of σ2. Conditional on the data y and the other parameters of the model, all the information about σ comes from the variances Vi. The conditional posterior distribution is,

from which we can sample directly.

Parameter expansion

For some problems, the Gibbs sampler can be slow to converge because of posterior dependence among parameters that cannot simply be resolved with a linear transformation. Paradoxically, adding an additional parameter—thus performing the random walk in a larger space—can improve the convergence of the Markov chain simulation. We illustrate with the t example above.

Example. Fitting the t model (continued)

In the latent-parameter form (12.1) of the t model, convergence will be slow if a simulation draw of σ is close to zero, because the conditional distributions will then cause the Vi’s to be sampled with values near zero, and then the conditional distribution of σ will be near zero, and so on. Eventually the simulations will get unstuck but it can be slow for some problems. We can fix things by adding a new parameter whose only role is to allow the Gibbs sampler to move in more directions and thus avoid getting stuck. The expanded model is,

where α > 0 can be viewed as an additional scale parameter. In this new model, α2Ui plays the role of Vi in (12.1) and ατ plays the role of σ. The parameter α has no meaning on its own and we can assign it a noninformative uniform prior distribution on the logarithmic scale.

The Gibbs sampler on this expanded model now has four steps:

1.  For each i, Ui is updated much as Vi was before:

2.  The mean, μ, is updated as before:

3.  The variance parameter τ2, is updated much as σ2 was before:

4.  Finally, we must update α2, which is easy since conditional on all the other parameters in the model it is simply a normal variance parameter:

The parameters α2, U, τ in this expanded model are not identified in that the data do not supply enough information to estimate each of them. However, the model as a whole is identified as long as we monitor convergence of the summaries μ, σ = ατ, and Vi = α2Ui for i = 1, …, n. (Or, if the only goal is inference for the original t model, we can simply save μ and σ from the simulations.)

The Gibbs sampler under the expanded aparameterizations converges more reliably because the new parameter α breaks the dependence between τ and the Vi’s.

We discuss parameter expansion for hierarchical models in Section 15.5 and illustrate in Appendix C.

12.2   Efficient Metropolis jumping rules

For any given posterior distribution, the Metropolis-Hastings algorithm can be implemented in an infinite number of ways. Even after reparameterizing, there are still endless choices in the jumping rules, Jt. In many situations with conjugate families, the posterior simulation can be performed entirely or in part using the Gibbs sampler, which is not always efficient but generally is easy to program, as we illustrated with the hierarchical normal model in Section 11.6. For nonconjugate models we must rely on Metropolis-Hastings algorithms (either within a Gibbs sampler or directly on the multivariate posterior distribution). The choice of jumping rule then arises.

There are two main classes of simple jumping rules. The first are essentially random walks around the parameter space. These jumping rules are often normal jumping kernels with mean equal to the current value of the parameter and variance set to obtain efficient algorithms. The second approach uses proposal distributions that are constructed to closely approximate the target distribution (either the conditional distribution of a subset in a Gibbs sampler or the joint posterior distribution). In the second case the goal is to accept as many draws as possible with the Metropolis-Hastings acceptance step being used primarily to correct the approximation. There is no natural advantage to altering one parameter at a time except for potential computational savings in evaluating only part of the posterior density at each step.

It is hard to give general advice on efficient jumping rules, but some results have been obtained for random walk jumping distributions that have been useful in many problems. Suppose there are d parameters, and the posterior distribution of θ = (θ1, …, θd), after appropriate transformation, is multivariate normal with known variance matrix σ. Further suppose that we will take draws using the Metropolis algorithm with a normal jumping kernel centered on the current point and with the same shape as the target distribution: that is, J(θ*θt-1) = N(θ*θt-1, c2σ). Among this class of jumping rules, the most efficient has scale c ≈ 2.4/√d, where efficiency is defined relative to independent sampling from the posterior distribution. The efficiency of this optimal Metropolis jumping rule for the d-dimensional normal distribution can be shown to be about 0.3/d (by comparison, if the d parameters were independent in their posterior distribution, the Gibbs sampler would have efficiency 1/d, because after every d iterations, a new independent draw of θ would be created). Which algorithm is best for any particular problem also depends on the computation time for each iteration, which in turn depends on the conditional independence and conjugacy properties of the posterior density.

A Metropolis algorithm can also be characterized by the proportion of jumps that are accepted. For the multivariate normal random walk jumping distribution with jumping kernel the same shape as the target distribution, the optimal jumping rule has acceptance rate around 0.44 in one dimension, declining to about 0.23 in high dimensions (roughly d > 5). This result suggests an adaptive simulation algorithm:

1.  Start the parallel simulations with a fixed algorithm, such as a version of the Gibbs sampler, or the Metropolis algorithm with a normal random walk jumping rule shaped like an estimate of the target distribution (using the covariance matrix computed at the joint or marginal posterior mode scaled by the factor 2.4/√d).

2.  After some number of simulations, update the Metropolis jumping rule as follows.

(a)   Adjust the covariance of the jumping distribution to be proportional to the posterior covariance matrix estimated from the simulations.

(b)   Increase or decrease the scale of the jumping distribution if the acceptance rate of the simulations is much too high or low, respectively. The goal is to bring the jumping rule toward the approximate optimal value of 0.44 (in one dimension) or 0.23 (when many parameters are being updated at once using vector jumping).

This algorithm can be improved in various ways, but even in its simple form, we have found it useful for drawing posterior simulations for some problems with d ranging from 1 to 50.

Adaptive algorithms

When an iterative simulation algorithm is ‘tuned’—that is, modified while it is running—care must be taken to avoid converging to the wrong distribution. If the updating rule depends on previous simulation steps, then the transition probabilities are more complicated than as stated in the Metropolis-Hastings algorithm, and the iterations will not in general converge to the target distribution. To see the consequences, consider an adaptation that moves the algorithm more quickly through flat areas of the distribution and moves more slowly when the posterior density is changing rapidly. This would make sense as a way of exploring the target distribution, but the resulting simulations would spend disproportionately less time in the flat parts of the distribution and more time in variable parts; the resulting simulation draws would not match the target distribution unless some sort of correction is applied.

To be safe, we typically run any adaptive algorithm in two phases: first, the adaptive phase, where the parameters of the algorithm can be tuned as often as desired to increase the simulation efficiency, and second, a fixed phase, where the adapted algorithm is run long enough for approximate convergence. Only simulations from the fixed phase are used in the final inferences.

12.3   Further extensions to Gibbs and Metropolis

Slice sampling

A random sample of θ from the d-dimensional target distribution, p(θy), is equivalent to a random sample from the area under the distribution (for example, the shaded area under the curve in the illustration of rejection sampling in Figure 10.1 on page 264). Formally, sampling is performed from the d+1-dimensional distribution of (θ, u), where, for any θ, p(θ, uy) ∝ 1 for u [0, p(θy)] and 0 otherwise. Slice sampling refers to the application of iterative simulation algorithms on this uniform distribution. The details of implementing an effective slice sampling procedure can be complicated, but the method can be applied in great generality and can be especially useful for sampling one-dimensional conditional distributions in a Gibbs sampling structure.

Reversible jump sampling for moving between spaces of differing dimensions

In a number of settings it is desirable to carry out a trans-dimensional Markov chain simulation, in which the dimension of the parameter space can change from one iteration to the next. One example where this occurs is in model averaging where a single Markov chain simulation is constructed that includes moves among a number of plausible models (perhaps regression models with different sets of predictors). The ‘parameter space’ for such a Markov chain simulation includes the traditional parameters along with an indication of the current model. A second example includes finite mixture models (see Chapter 22) in which the number of mixture components is allowed to vary.

It is still possible to perform the Metropolis algorithm in such settings, using the method of reversible jump sampling. We use notation corresponding to the case where a Markov chain moves among a number of candidate models. Let Mk, k = 1, …, K, denote the candidate models and θk the parameter vector for model k with dimension dk. A key aspect of the reversible jump approach is the introduction of additional random variables that enable the matching of parameter space dimensions across models. Specifically if a move from k to k* is being considered, then an auxiliary random variable u with jumping distribution J(uk, k*, θk) is generated. A series of one-to-one deterministic functions are defined that do the dimension-matching with (θk*, u*) = gk,k*(θk, u) and dk + dim(u) =dk* + dim(u*). The dimension matching ensures that the balance condition needed to prove the convergence of the Metropolis-Hastings algorithm in Chapter 11 continues to hold here.

We present the reversible jump algorithm in general terms followed by an example. For the general description, let πk denote the prior probability on model k, p(θkMk) the prior distribution for the parameters in model k, and p(yθk, Mk) the sampling distribution under model k. Reversible jump Markov chain simulation generates samples from p(k, θky) using the following three steps at each iteration:

1.  Starting in state (k, θk) (that is, model Mk with parameter vector θk), propose a new model Mk* with probability Jk,k* and generate an augmenting random variable u from proposal density .

2.  Determine the proposed model’s parameters, .

3.  Define the ratio

and accept the new model with probability min(r, 1).

The resulting posterior draws provide inference about the posterior probability for each model as well as the parameters under that model.

Example. Testing a variance component in a logistic regression

The application of reversible jump sampling, especially the use of the auxiliary random variables u, is seen most easily through an example.

Consider a probit regression for survival of turtles in a natural selection experiment. Let yij denote the binary response for turtle i in family j with Pr(yij = 1) = pij for i = 1, …, nj and j = 1, …, J. The weight xij of the turtle is known to affect survival probability, and it is likely that familial factors also play a role. This suggests the model, pij = 0 + α1xij + bj). It is natural to model the bj’s as exchangeable family effects, bj ~ N(0, τ2). The prior distribution p0, α1, τ) is not central to this discussion so we do not discuss it further here.

Suppose for the purpose of this example that we seek to test whether the variance component τ is needed by running a Markov chain that considers the model with and without the varying intercepts, bj. As emphasized in Chapters 67, we much prefer to fit the model with the variance parameter and assess its importance by examining its posterior distribution. However, it might be of interest to consider the model that allows τ = 0 as a discrete possibility, and we choose this example to illustrate the reversible jump algorithm.

Let M0 denote the model with τ = 0 (no variance component) and M1 denote the model including the variance component. We use numerical integration to compute the marginal likelihood p(yα0, α1, τ) for model M1. Thus the bj’s are not part of the iterative simulation under model M1. The reversible jump algorithm takes π0 = π1 = 0.5 and J0,0 = J0, 1 = J1,0 = J1,1 = 0.5. At each step we either take a Metropolis step within the current model (with probability 0.5) or propose a jump to the other model. If we are in model 0 and are proposing a jump to model 1, then the auxiliary random variable is u ~ J(u) (scaled inverse-2 in this case) and we define the parameter vector for model 1 by setting τ2 = u and leaving α0 and α1 as they were in the previous iteration. The ratio (12.2) is then

because the prior distributions on α and the models cancel, and the Jacobian of the transformation is 1. The candidate model is accepted with probability min(r, 1).

There is no auxiliary random variable for going from model 1 to model 0. In that case we merely set τ = 0, and the acceptance probability is the reciprocal of the above. In the example we chose J2) based on a pilot analysis of model M1 (an inverse-2 distribution matching the posterior mean and variance).

Simulated tempering and parallel tempering

Multimodal distributions can pose special problems for Markov chain simulation. The goal is to sample from the entire posterior distribution and this requires sampling from each of the modes with significant posterior probability. Unfortunately it is easy for Markov chain simulations to remain in the neighborhood of a single mode for a long period of time. This occurs primarily when two (or more) modes are separated by regions of extremely low posterior density. Then it is difficult to move from one mode to the other because, for example, Metropolis jumps to the region between the two modes are rejected.

Simulated tempering is one strategy for improving Markov chain simulation performance in this case. As usual, we take p(θy) to be the target density. The algorithm works with a set of K+1 distributions pk(θy), k = 0, 1, …, K, where p0(θu) = p(θy), and p1, …, pK are distributions with the same basic shape but with improved opportunities for mixing across the modes, and each of these distributions comes with its own sampler (which might, for example, be a separately tuned Metropolis or HMC algorithm). As usual, the distributions pk need not be fully specified; it is only necessary that the user can compute unnormalized density functions qk, where qk(θ) = pk(θy) multiplied by a constant which can depend on y and k but not on the parameters θ. (We write qk(θ), but with the understanding that, since the qk’s are built for a particular posterior distribution p(θy), they can in general depend on y.)

One choice for the ladder of unnormalized densities qk is

for a set of ‘temperature’ parameters Tk > 0. Setting Tk = 1 reduces to the original density, and large values of Tk produce less highly peaked modes. (That is, ‘high temperatures’ add ‘thermal noise’ to the system.) A single composite Markov chain simulation is then developed that randomly moves across the K+1 distributions, with T0 set to 1 so that q0(θ) ∝ p(θy). The state of the composite Markov chain at iteration t is represented by the pair (θt, st), where st is an integer identifying the distribution used at iteration t. Each iteration of the composite Markov chain simulation consists of two steps:

1.  A new value θt+1 is selected using the Markov chain simulation with stationary distribution qst.

2.  A jump from the current sampler st to an alternative sampler j is proposed with probability Jst, j. We accept the move with probability min(r, 1), where

The constants ck for k = 0, 1, …, K are set adaptively (that is, assigned initial values and then altered after the simulation has run a while) to approximate the inverses of the normalizing constants for the distributions defined by the unnormalized densities qk. The chain will then spend an approximately equal amount of time in each sampler.

At the end of the Markov chain simulation, only those values of θ simulated from the target distribution (q0) are used to obtain posterior inferences.

Parallel tempering is a variant of the above algorithm in which K + 1 parallel chains are simulated, one for each density qk in the ladder. Each chain moves on its own but with occasional flipping of states between chains, with a Metropolis accept-reject rule similar to that in simulated tempering. At convergence, the simulations from chain 0 represent draws from the target distribution.

Other auxiliary variable methods have been developed that are tailored to particular structures of multivariate distributions. For example, highly correlated variables such as arise in spatial statistics can be simulated using multigrid sampling, in which computations are done alternately on the original scale and on coarser scales that do not capture the local details of the target distribution but allow faster movement between states.

Particle filtering, weighting, and genetic algorithms

Particle filtering describes a class of simulation algorithms involving parallel chains, in which existing chains are periodically tested and allowed to die, live, or split, with the rule set up so that chains in lower-probability areas of the posterior distribution are more likely to die and those in higher-probability areas are more likely to split. The idea is that a large number of chains can explore the parameter space, with the birth/death/splitting steps allowing the ensemble of chains to more rapidly converge to the target distribution. The probabilities of the different steps are set up so that the stationary distribution of the entire process is the posterior distribution of interest.

A related idea is weighting, in which a simulation is performed that converges to a specified but wrong distribution, g(θ), and then the final draws are weighted by p(θy)/g(θ). In more sophisticated implementations, this reweighting can be done throughout the simulation process. It can sometimes be difficult or expensive to sample from p(θy) and faster to work with a good approximation g if available. Weighting can be combined with particle filtering by using the weights in the die/live/split probabilities.

Genetic algorithms are similar to particle filtering in having multiple chains that can live or die, but with the elaboration that the updating algorithms themselves can change (‘mutate’) and combine (‘sexual reproduction’). Many of these ideas are borrowed from the numerical analysis literature on optimization but can also be effective in a posterior simulation setting in which the goal is to converge to a distribution rather than to a single best value.

12.4   Hamiltonian Monte Carlo

An inherent inefficiency in the Gibbs sampler and Metropolis algorithm is their random walk behavior—as illustrated in Figures 11.1 and 11.2 on pages 276 and 277, the simulations can take a long time zigging and zagging while moving through the target distribution. Reparameterization and efficient jumping rules can improve the situation (see Sections 12.1 and 12.2), but for complicated models this local random walk behavior remains, especially for high-dimensional target distributions.

Hamiltonian Monte Carlo (HMC) borrows an idea from physics to suppress the local random walk behavior in the Metropolis algorithm, thus allowing it to move much more rapidly through the target distribution. For each component θj in the target space, Hamiltonian Monte Carlo adds a ‘momentum’ variable j. Both θ and are then updated together in a new Metropolis algorithm, in which the jumping distribution for θ is determined largely by . Each iteration of HMC proceeds via several steps, during which the position and momentum evolve based on rules imitating the behavior of position the steps can move rapidly where possible through the space of θ and even can turn corners in parameter space to preserve the total ‘energy’ of the trajectory. Hamiltonian Monte Carlo is also called hybrid Monte Carlo because it combines MCMC and deterministic simulation methods.

In HMC, the posterior density p(θy) (which, as usual, needs only be computed up to a multiplicative constant) is augmented by an independent distribution p() on the momenta, thus defining a joint distribution, . We simulate from the joint distribution but we are only interested in the simulations of θ; the vector is thus an auxiliary variable, introduced only to enable the algorithm to move faster through the parameter space.

In addition to the posterior density (which, as usual, needs to be computed only up to a multiplicative constant), HMC also requires the gradient of the log-posterior density. In practice the gradient must be computed analytically; numerical differentiation requires too many function evaluations to be computationally effective. If θ has d dimensions, this gradient is . For most of the models we consider in this book, this vector is easy to determine analytically and then program. When writing and debugging the program, we recommend also programming the gradient numerically (using finite differences of the log-posterior density) as a check on the programming of the analytic gradients. If the two subroutines do not return identical results to several decimal places, there is likely a mistake somewhere.

The momentum distribution, p()

It is usual to give a multivariate normal distribution (recall that has the same dimension as θ) with mean 0 and covariance set to a prespecified ‘mass matrix’ M (so called by analogy to the physical model of Hamiltonian dynamics). To keep it simple, we commonly use a diagonal mass matrix, M. If so, the components of are independent, with j ~ N(0, Mjj) for each dimension j = 1, …, d. It can be useful for M to roughly scale with the inverse covariance matrix of the posterior distribution, (var(θy))-1, but the algorithm works in any case; better scaling of M will merely make HMC more efficient.

The three steps of an HMC iteration

HMC proceeds by a series of iterations (as in any Metropolis algorithm), with each iteration having three parts:

1.  The iteration begins by updating with a random draw from its posterior distribution—which, as specified, is the same as its prior distribution, ~ N(0, M).

2.  The main part of the Hamiltonian Monte Carlo iteration is a simultaneous update of (θ, ), conducted in an elaborate but effective fashion via a discrete mimicking of physical dynamics. This update involves L ‘leapfrog steps’ (to be defined in a moment), each scaled by a factor . In a leapfrog step, both θ and are changed, each in relation to the other. The L leapfrog steps proceed as follows:

Repeat the following steps L times:

(a)   Use the gradient (the vector derivative) of the log-posterior density of θ to make a half-step of :

(b)   Use the ‘momentum’ vector to update the ‘position’ vector θ:

Again, M is the mass matrix, the covariance of the momentum distribution p(). If M is diagonal, the above step amounts to scaling each dimension of the θ update. (It might seem redundant to include in the above expression: why not simply absorb it into M, which can itself be set by the user? The reason is that it can be convenient in tuning the algorithm to alter while keeping M fixed.)

(c)   Again use the gradient of θ to half-update :

Except at the first and last step, updates (c) and (a) above can be performed together. The stepping thus starts with a half-step of , then alternates L - 1 full steps of the parameter vector θ and the momentum vector , and concludes with a half-step of . This algorithm (called a ‘leapfrog’ because of the splitting of the momentum updates into half steps) is a discrete approximation to physical Hamiltonian dynamics in which both position and momentum evolve in continuous. time.

In the limit of near zero, the leapfrog algorithm preserves the joint density p(θ, y). We will not give the proof, but here is some intuition. Suppose the current value of θ is at a flat area of the posterior. Then d log p(θy)/dθ will be zero, and in step 2 above, the momentum will remain constant. Thus the leapfrog steps will skate along in θ-space with constant velocity. Now suppose the algorithm moves toward an area of low posterior density. Then d log p(θy)/dθ will be negative in this direction, thus in step 2 inducing a decrease in the momentum in the direction of movement. As the leapfrog steps continue to move into an area of lower density in θ-space, the momentum continues to decrease. The decrease in log p(θy) is matched (in the limit → 0, exactly so) by a decrease in the ‘kinetic energy,’ log p(). And if iterations continue to move in the direction of decreasing density, the leapfrog steps will slow to zero and then back down or curve around the dip. Now consider the algorithm heading in a direction in which the posterior density is increasing. Then d log p(θy)/dθ will be positive in that direction, leading in step 2 to an increase in momentum in that direction. As log p(θy) increases, log p() increases correspondingly until the trajectory eventually moves past or around the mode and then starts to slow down.

For finite , the joint density p(θ, y) does not remain entirely constant during the leapfrog steps but it will vary only slowly if is small. For reasons we do not discuss here, the leapfrog integrator has the pleasant property that combining L steps of error δ does not produce Lδ error, because the dynamics of the algorithm tend to send the errors weaving back and forth around the exact value that would be obtained by a continuous integration. Keeping the discretization error low is important because of the next part of the HMC algorithm, the accept/reject step.

3.  Label θt-1, t-1 as the value of the parameter and momentum vectors at the start of the leapfrog process and θ*, * as the value after the L steps. In the accept-reject step, we compute

4.  Set

Strictly speaking it would be necessary to set t as well, but since we do not care about in itself, and it gets immediately updated at the beginning of the next iteration (see step 1 above), so there is no need to keep track of it after the accept/reject step.

As with any other MCMC algorithm, we repeat these iterations until approximate convergence, as assessed by being near 1 and the effective sample size being large enough for all quantities of interest; see Section 11.4.

Restricted parameters and areas of zero posterior density

HMC is designed to work with all-positive target densities. If at any point during an iteration the algorithm reaches a point of zero posterior density (for example, if the steps go below zero when updating a parameter that is restricted to be positive), we stop the stepping and give up, spending another iteration at the previous value of θ. The resulting algorithm preserves detailed balance and stays in the positive zone.

An alternative is ‘bouncing,’ where again the algorithm checks that the density is positive after each step and, if not, changes the sign of the momentum to return to the direction in which it came. This again preserves detailed balance and is typically more efficient than simply rejecting the iteration, for example with a hard boundary for a parameter that is restricted to be positive.

Another way to handle bounded parameters is via transformation, for example taking the logarithm of a parameter constrained to be positive or the logit for a parameter constrained to fall beween 0 and 1, or more complicated joint transformations for sets of parameters that are constrained (for example, if . One must then work out the Jacobian of the transformation and use it to determine the log posterior density and its gradient in the new space.

Setting the tuning parameters

HMC can be tuned in three places: (i) the probability distribution for the momentum variables (which, in our implementation requires specifying the diagonal elements of a covariance matrix, that is, a scale parameter for each of the d dimensions of the parameter vector), (ii) the scaling factor of the leapfrog steps, and (iii) the number of leapfrog steps L per iteration.

As with the Metropolis algorithm in general, these tuning parameters can be set ahead of time, or they can be altered completely at random (a strategy which can sometimes be helpful in keeping an algorithm from getting stuck), but one has to take care when altering them given information from previous iterations. Except in some special cases, adaptive updating of the tuning parameters alters the algorithm so that it no longer converges to the target distribution. So when we set the tuning parameters, we do so during the warm-up period: that is, we start with some initial settings, then run HMC for a while, then reset the tuning parameters based on the iterations so far, then discard the early iterations that were used for warm-up. This procedure can be repeated if necessary, as long as the saved iterations use only simulations after the last setting of the tuning parameters.

How, then, to set the parameters that govern HMC? We start by setting the scale parameters for the momentum variables to some crude estimate of the scale of the target distribution. (One can also incorporate covariance information but here we will assume a diagonal covariance matrix so that all that is required is the vector of scales.) By default we could simply use the identity matrix.

We then set the product L to 1. This roughly calibrates the HMC algorithm to the ‘radius’ of the target distribution; that is, L steps, each of length times the already-chosen scale of , should roughly take you from one side of the distribution to the other. A default starting point could be = 0.1, L = 10.

Finally, theory suggests that HMC is optimally efficient when its acceptance rate is approximately 65% (based on an analysis similar to that which finds an optimal 23% acceptance rate for the multidimensional Metropolis algorithm). The theory is based on all sorts of assumptions but seems like a reasonable guideline for optimization in practice. For now we recommend a simple adaptation in which HMC is with its initial settings and then adapted if the average acceptance probability (as computed from the simulations so far) is not close to 65%. If the average acceptance probability is lower, then the leapfrog jumps are too ambitious and you should lower and correspondingly increase L (so their product remains 1). Conversely, if the average acceptance probability is much higher than 65%, then the steps are too cautious and we recommend raising and lowering L (not forgetting that L must be an integer). These rules do not solve all problems, and it should be possible to develop diagnostics to assess the efficiency of HMC to allow for more effective adaptation of the tuning parameters.

Varying the tuning parameters during the run

As with MCMC tuning more generally, any adaptation can go on during the warm-up period, but adaptation performed later on, during the simulations that will be used for inference, can cause the algorithm to converge to the wrong distribution. For example, suppose we were to increase the step size after high-probability jumps and decrease when the acceptance probability is low. Such an adaptation seems appealing but would destroy the detailed balance (that is, the property of the algorithm that the flow of probability mass from point A to B is the same as from B to A, for any points A and B in the posterior distribution) that is used to prove that the posterior distribution of interest is the stationary distribution of the Markov chain.

Completely random variation of and L, however, causes no problems with convergence and can be useful. If we randomly vary the tuning parameters (within specified ranges) from iteration to iteration while the simulation is running, the algorithm has a chance to take long tours through the posterior distribution when possible and make short movements where the iterations are stuck in a cramped part of the space. The price for this variation is some potential loss of optimality, as the algorithm will also take short steps where long tours would be feasible and try for long steps where the space is too cramped for such jumps to be accepted.

Locally adaptive HMC

For difficult HMC problems, it would be desirable for the tuning parameters to vary as the algorithm moves through the posterior distribution, with the mass matrix M scaling to the local curvature of the log density, the step size getting smaller in areas where the curvature is high, and the number of steps L being large enough for the trajectory to move far through the posterior distribution without being so large that the algorithm circles around and around. To this end, researchers have developed extensions of HMC that adapt without losing detailed balance. These algorithms are more complicated and can require more computations per iteration but can converge more effectively for complicated distributions. We describe two such algorithms here but without giving the details.

The no-U-turn sampler. In the no-U-turn sampler, the number of steps is determined adaptively at each iteration. Instead of running for a fixed number of steps, L, the trajectory in each iteration continues until it turns around (more specifically, until we reach a negative value of the dot product between the momentum variable and the distance traveled from the position θ at the start of the iteration). This rule essentially sends the trajectory as far as it can go during that iteration. If such a rule is applied alone, the simulations will not converge to the desired target distribution. The full no-U-turn sampler is more complicated, going backward and forward along the trajectory in a way that satisfies detailed balance. Along with this algorithm comes a procedure for adaptively setting the mass matrix M and step size ; these parameters are tuned during the warm-up phase and then held fixed during the later iterations which are kept for the purpose of posterior inference.

Riemannian adaptation. Another approach to optimization is Riemannian adaptation, in which the mass matrix M is set to conform with the local curvature of the log posterior density at each step. Again, the local adaptation allows the sampler to move much more effectively but the steps of the algorithm need to become more complicated to maintain detailed balance. Riemannian adaptation can be combined with the no-U-turn sampler.

Neither of the above extensions solves all the problems with HMC. The no-U-turn sampler is self-tuning and computationally efficient but, like ordinary Hamiltonian Monte Carlo, has difficulties with very short-tailed and long-tailed distributions, in both cases having difficulties transitioning from the center to the tails, even in one dimension. Riemannian adaptation handles varying curvature and non-exponentially tailed distributions but is impractical in high dimensions.

Combining HMC with Gibbs sampling

There are two ways in which ideas of the Gibbs sampler fit into Hamiltonian Monte Carlo. First, it can make sense to partition variables into blocks, either to simplify computation or to speed convergence. Consider a hierarchical model with J groups, with parameter vector θ = (η(1), η(2), …, η(J), ), where each of the η(j)’s is itself a vector of parameters corresponding to the model for group j and is a vector of hyperparameters, and for which the posterior distribution can be factored as, . In this case, even if it is possible to update the entire vector θ at once using HMC, it may be more effective—in computation speed or convergence—to cycle through J + 1 updating steps, altering each η(j) and then during each cycle. This way we only have to work with at most one of the likelihood factors, , at each step. Parameter expansion can be used to facilitate quicker mixing through the joint distribution.

The second way in which Gibbs sampler principles can enter HMC is through the updating of discrete variables. Hamiltonian dynamics are only defined on continuous distributions. If some of the parameters in a model are defined on discrete spaces (for example, latent indicators for mixture components, or a parameter that follows a continuous distribution but has a positive probability of being exactly zero), they can be updated using Gibbs steps or, more generally, one-dimensional updates such as Metropolis or slice sampling (see Section 12.3). The simplest approach is to partition the space into discrete and continuous parameters, then alternate HMC updates on the continuous subspace and Gibbs, Metropolis, or slice updates on the discrete components.

12.5   Hamiltonian dynamics for a simple hierarchical model

We illustrate the tuning of Hamiltonian Monte Carlo with the model for the educational testing experiments described in Chapter 5. HMC is not necessary in this problem—the Gibbs sampler works just fine, especially after the parameter expansion which allows more efficient movement of the hierarchical variance parameter (see Section 12.1)—but it is helpful to understand the new algorithm in a simple example. Here we go through all the steps of the algorithm. The code appears in Section C.4, starting on page 601.

In order not to overload our notation, we label the eight school effects (defined as θj in Chapter 5) as αj; the full vector of parameters θ then has d = 10 dimensions, corresponding to α1, …, α8, μ, τ.

Gradients of the log posterior density. For HMC we need the gradients of the log posterior density for each of the ten parameters, a set of operations that are easily performed with the normal distributions of this model:

As a debugging step we also compute the gradients numerically using finite differences of ±0.0001 on each component of θ. Once we have checked that the two gradient routines yield identical results, we use the analytic gradient in the algorithm as it is faster to compute.

The mass matrix for the momentum distribution. As noted above, we want to scale the mass matrix to roughly match the posterior distribution. That said, we typically only have a vague idea of the posterior scale before beginning our computation; thus this scaling is primarily intended to forestall the problems that would arise if there are gross disparities in the scaling of different dimensions. In this case, after looking at the data in Table 5.2 we assign a rough scale of 15 for each of the parameters in the model and crudely set the mass matrix to Diag(15, …, 15).

Starting values. We run 4 chains of HMC with starting values drawn at random to crudely match the scale of the parameter space, in this case following the idea above and drawing the ten parameters in the model from independent N(0, 152) distributions.

Tuning and L. To give the algorithm more flexibility, we do not set and L to fixed values. Instead we choose central values 0, L0 and then at each step draw and L independently from uniform distributions on (0, 20) and [1, 2L0], respectively (with the distribution for L being discrete uniform, as L must be an integer). We have no reason to think this particular jittering is ideal; it is just a simple way to vary the tuning parameters in a way that does not interfere with convergence of the algorithm. Following the general advice given above, we start by setting 0L0 = 1 and L0 = 10. We simulate 4 chains for 20 iterations just to check that the program runs without crashing.

We then do some experimentation. We first run 4 chains for 100 iterations and see that the inferences are reasonable (no extreme values, as can sometimes happen when there is poor convergence or a bug in the program) but not yet close to convergence, with several values of that are more than 2. The average acceptance probabilities of the 4 chains are 0.23, 0.59, 0.02, and 0.57, well below 65%, so we suspect the step size is too large.

We decrease 0 to 0.05, increase L0 to 20 (thus keeping 0L0 constant), and rerun the 4 chains for 100 iterations, now getting acceptance rates of 0.72,, 0.87, 0.33, and 0.55, with chains still far from mixing. At this point we increase the number of simulations to 1000. The simulations now are close to convergence, with less than 1.2 for all parameters, and average acceptance probabilities are more stable, at 0.52, 0.68, 0.75, and 0.51. We then run 4 chains at 10,000 simulations at these tuning parameters and achieve approximate convergence, with less than 1.1 for all parameters.

In this particular example, HMC is unnecessary, as the Gibbs sampler works fine on an appropriately transformed scale. In larger and more difficult problems, however, Gibbs and Metropolis can be too slow, while HMC can move effectively efficiently move through high-dimensional parameter spaces.

Transforming to log τ

When running HMC on a model with constrained parameters, the algorithm can go outside the boundary, thus wasting some iterations. One remedy is to transform the space to be unconstrained. In this case, the simplest way to handle the constraint τ > 0 is to transform to log τ. We then must alter the algorithm in the following ways:

1.  We redefine θ as (α1, …, α8, μ, log τ) and do all jumping on this new space.

2.  The (unnormalized) posterior density p(θy) is multiplied by the Jacobian, τ, so we add log τ to the log posterior density used in the calculations.

3.  The gradient of the log posterior density changes in two ways: first, we need to account for the new term added just above; second, the derivative for the last component of the gradient is now with respect to log τ rather than τ and so must be multiplied by the Jacobian, τ:

4.  We change the mass matrix to account for the transformation. We keep α1, …, α8, μ with masses of 15 (roughly corresponding to a posterior distribution with a scale of 15 in each of these dimensions) but set the mass of log τ to 1.

5.  We correspondingly change the initial values by drawing the first nine parameters from independent N(0, 152) distributions and log τ from N(0, 1).

HMC runs as before. Again, we start with = 0.1 and L = 10 and then adjust to get a reasonable acceptance rate.

12.6   Stan: developing a computing environment

Hamiltonian Monte Carlo takes a bit of effort to program and tune. In more complicated settings, though, we have found HMC to be faster and more reliable than basic Markov chain simulation algorithms.

To mitigate the challenges of programming and tuning, we have developed a computer program, Stan (Sampling through adaptive neighborhoods) to automatically apply HMC given a Bayesian model. The key steps of the algorithm are data and model input, computation of the log posterior density (up to an arbitrary constant that cannot depend on the parameters in the model) and its gradients, a warm-up phase in which the tuning parameters are set, an implementation of the no-U-turn sampler to move through the parameter space, and convergence monitoring and inferential summaries at the end.

We briefly describe how each of these steps is done in Stan. Instructions and examples for running the program appear in Appendix C.

Entering the data and model

Each line of a Stan model goes into defining the log probability density of the data and parameters, with code for looping, conditioning, computation of intermediate quantities, and specification of terms of the log joint density. Standard distributions such as the normal, gamma, binomial, Poisson, and so forth, are preprogrammed, and arbitrary distributions can be entered by directly programming the log density. Algebraic manipulations and functions such as exp and logit can also be included in the specification; it is all just sent into C++.

To compute gradients, Stan uses automatic analytic differentiation, using an algorithm that parses arbitrary C++ expressions and then applies basic rules of differential calculus to construct a C++ program for the gradient. For computational efficiency, we have preprogrammed the gradients for various standard statistical expressions to make up some of this difference. We use special scalar variable classes that evaluate the function and at the same time construct the full expression tree used to generate the log probability. Then the reverse pass walks backward down the expression tree (visiting every dependent node before any node it depends on), propagating partial derivatives by the chain rule. The walk over the expression tree implicitly employs dynamic programming to minimize the number of calculations. The resulting autodifferentiation is typically much faster than computing the gradient numerically via finite differences.

In addition to the data, parameters, and model statements, a Stan call also needs the number of chains, the number of iterations per chain, and various control parameters that can be set by default. Starting values can be supplied or else they are generated from preset default random variables.

Setting tuning parameters in the warm-up phase

As noted above, it can be tricky to tune Hamiltonian Monte Carlo for any particular example. The no-U-turn sampler helps with this, as it eliminates the need to assign the number of steps L, but we still need to set the mass matrix M and step size . During a prespecified warm-up phase of the simulation, Stan adaptively alters M and using ideas from stochastic optimization in numerical analysis. This adaptation will not always work—for distributions with varying curvature, there will not in general be any single good set of tuning parameters—and if the simulation is having difficulty converging, it can make sense to look at the values of M and chosen for different chains to better understand what is happening. Convergence can sometimes be improved by reparameterization. More generally, it could make sense to have different tuning parameters for different areas of the distribution—this is related to ideas such as Riemannian adaptation, which at the time of this writing we are incorporating into Stan.

No-U-turn sampler

Stan runs HMC using the no-U-turn sampler, preprocessing where possible by transforming bounded variables to put them on an unconstrained scale. For complicated constraints this cannot always be done automatically and then it can make sense for the user to reparameterize in writing the model. While running, Stan keeps track of acceptance probabilities (as well as the simulations themselves), which can be helpful in getting inside the algorithm if there are problems with mixing of the chains.

Inferences and postprocessing

Stan produces multiple sequences of simulations. For our posterior inferences we discard the iterations from the warm-up period (but we save them as possibly of diagnostic use if the algorithm is not mixing well) and compute and neff as described in Section 11.4.

12.7   Bibliographic note

For the relatively simple ways of improving simulation algorithms mentioned in Sections 12.1 and 12.2, Tanner and Wong (1987) discuss data augmentation and auxiliary variables, and Hills and Smith (1992) and Roberts and Sahu (1997) discuss different parameterizations for the Gibbs sampler. Higdon (1998) discusses some more complicated auxiliary variable methods, and Liu and Wu (1999), van Dyk and Meng (2001), and Liu (2003) present different approaches to parameter expansion. The results on acceptance rates for efficient Metropolis jumping rules appear in Gelman, Roberts, and Gilks (1995); more general results for Metropolis-Hastings algorithms appear in Roberts and Rosenthal (2001) and Brooks, Giudici, and Roberts (2003).

Gelfand and Sahu (1994) discuss the difficulties of maintaining convergence to the target distribution when adapting Markov chain simulations, as discussed at the end of Section 12.2. Andrieu and Robert (2001) and Andrieu and Thoms (2008) consider adaptive Markov chain Monte Carlo algorithms.

Slice sampling is discussed by Neal (2003), and simulated tempering is discussed by Geyer and Thompson (1993) and Neal (1996b). Besag et al. (1995) and Higdon (1998) review several ideas based on auxiliary variables that have been useful in high-dimensional problems arising in genetics and spatial models.

Reversible jump MCMC was introduced by Green (1995); see also Richardson and Green (1997) and Brooks, Giudici, and Roberts (2003) for more on trans-dimensional MCMC.

Mykland, Tierney, and Yu (1994) discuss an approach to MCMC in which the algorithm has regeneration points, or subspaces of θ, so that if a finite sequence starts and ends at a regeneration point, it can be considered as an exact (although dependent) sample from the target distribution. Propp and Wilson (1996) and Fill (1998) introduce a class of MCMC algorithms called perfect simulation in which, after a certain number of iterations, the simulations are known to have exactly converged to the target distribution.

The book by Liu (2001) covers a wide range of advanced simulation algorithms including those discussed in this chapter. The monograph by Neal (1993) also overviews many of these methods. Hamiltonian Monte Carlo was introduced by Duane et al. (1987) in the physics literature and Neal (1994) for statistics problems. Neal (2011) reviews HMC, Hoffman and Gelman (2013) introduce the no-U-turn sampler, and Girolami and Calderhead (2011) introduce Riemannian updating; see also Betancourt and Stein (2011) and Betancourt (2012, 2013). Romeel (2011) explains how leapfrog steps tend to reduce discretization error in HMC. Leimkuhler and Reich (2004) discuss the mathematics in more detail. Griewank and Walther (2008) is a standard reference on algorithmic differentiation.

12.8   Exercises

1.  Efficient Metropolis jumping rules: Repeat the computation for Exercise 11.2 using the adaptive algorithm given in Section 12.2.

2.  Simulated tempering: Consider the Cauchy model, yi ~ Cauchy(θ, 1), i = 1, …, n, with two data points, y1 = 1.3, y2 = 15.0.

(a)   Graph the posterior density.

(b)   Program the Metropolis algorithm for this problem using a symmetric Cauchy jumping distribution. Tune the scale parameter of the jumping distribution appropriately.

(c)   Program simulated tempering with a ladder of 10 inverse-temperatures, 0.1, …, 1.

(d)   Compare your answers in (b) and (c) to the graph in (a).

3.  Hamiltonian Monte Carlo: Program HMC in R for the bioassay logistic regression example from Chapter 3.

(a)   Code the gradients analytically and numerically and check that the two programs give the same result.

(b)   Pick reasonable starting values for the mass matrix, step size, and number of steps.

(c)   Tune the algorithm to an approximate 65% acceptance rate.

(d)   Run 4 chains long enough so that each has an effective sample size of at least 100. How many iterations did you need?

(e)   Check that your inferences are consistent with those from the direct approach in Chapter 3.

4.  Coverage of intervals and rejection sampling: Consider the following model: yj ~ Binomial(nj, θj), where , and with independent prior distributions, . Assume J = 10, the xj values are randomly drawn from a U(1, 1) distribution, and nj ~ Poisson+(5), where Poisson+ is the Poisson distribution restricted to positive values.

(a)   Sample a dataset at random from the model, estimate α and β using Stan, and make a graph simultaneously displaying the data, the fitted model, and uncertainty in the fit (shown via a set of inverse logit curves that are thin and gray (in R, lwd=.5, col=“gray”)).

(b)   Did Stan’s posterior 50% interval for α contain its true value? How about β?

(c)   Use rejection sampling to get 1000 independent posterior draws from (α, β).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset