Chapter 13

Modal and distributional approximations

The early chapters of the book describe simulation approaches that work in low-dimensional problems. With complicated models, it is rare that samples from the posterior distribution can be obtained directly, and Chapters 11 and 12 describe iterative simulation algorithms that can be used with such models. In this chapter we describe various approaches based on distributional approximations. These methods are useful for quick inferences, as starting points for Markov chain simulation algorithms, and for large problems where iterative simulation approaches are too slow. The approximations that we describe are relatively simple to compute and can provide valuable information about the fit of the model.

In Section 13.1 we discuss algorithms for finding posterior modes. Beyond being useful in constructing distributional approximations, the posterior mode is often used in statistical practice as a point estimate, sometimes in the guise of a penalized likelihood estimate (where the logarithm of the prior density is considered as a penalty function). Section 13.2 discusses how, if the goal is to summarize the posterior distribution by a mode, it can make sense to use a different prior distribution than would be used in full Bayesian inference. Section 13.3 presents normal and normal-mixture approximations centered at the mode. We continue in Sections 13.4–13.6 with EM (expectation maximization) and related approaches for finding marginal posterior modes, along with related approximations based on factorizing the joint posterior distribution. Finally, Sections 13.7 and 13.8 introduce variational Bayes and expectation propagation, two methods for constructing approximations to a distribution based on conditional moments.

The proliferation of algorithms for Bayesian computing reflects the proliferation of challenging applied problems that we are trying to solve using Bayesian methods. These problems typically have large numbers of unknown parameters, hence the appeal of Bayesian inference and hence also the struggles with computation. When different approximating strategies are available, it can make sense to fit the model in multiple ways and then use the tools described in Chapters 6 and 7 to evaluate and compare them.

13.1   Finding posterior modes

In Bayesian computation, we search for modes not for their own sake, but as a way to begin mapping the posterior density. In particular, we have no special interest in finding the absolute maximum of the posterior density. If many modes exist, we should try to find them all, or at least all the modes with non-negligible posterior mass in their neighborhoods. In practice, we often first search for a single mode, and if it does not look reasonable in a substantive sense, we continue searching through the parameter space for other modes. To find all the local modes—or to make sure that a mode that has been found is the only important mode—sometimes one must run a mode-finding algorithm several times from different starting points.

Even better, where possible, is to find the modes of the marginal posterior density of a subset of the parameters. One then analyzes the distribution of the remaining parameters, conditional on the first subset. We return to this topic in Sections 13.4 and 13.5.

A variety of numerical methods exist for solving optimization problems and any of these, in principle, can be applied to find the modes of a posterior density. Rather than attempt to cover this vast topic comprehensively, we introduce two simple methods that are commonly used in statistical problems.

Conditional maximization

Often the simplest method of finding modes is conditional maximization, also called stepwise ascent; simply start somewhere in the target distribution—for example, setting the parameters at rough estimates—and then alter one set of components of θ at a time, leaving the other components at their previous values, at each step increasing the log posterior density. Assuming the posterior density is bounded, the steps will eventually converge to a local mode. The method of iterative proportional fitting for loglinear models (see Section 16.7) is an example of conditional maximization. To search for multiple modes, run the conditional maximization routine starting at a variety of points spread throughout the parameter space. It should be possible to find a range of reasonable starting points based on rough estimates of the parameters and problem-specific knowledge about reasonable bounds on the parameters.

For many standard statistical models, the conditional distribution of each parameter given all the others has a simple analytic form and is easily maximized. In this case, applying a conditional maximization algorithm is easy: just maximize the density with respect to one set of parameters at a time, iterating until the steps become small enough that approximate convergence has been reached. We illustrate this process in Section 13.6 for the example of the hierarchical normal model.

Newton’s method

Newton’s method, also called the Newton—Raphson algorithm, is an iterative approach based on a quadratic Taylor series approximation of the log posterior density,

It is also acceptable to use an unnormalized posterior density, since Newton’s method uses only the derivatives of L(θ), and any multiplicative constant in p is an additive constant in L. As we have seen in Chapter 4, the quadratic approximation is generally fairly accurate when the number of data points is large relative to the number of parameters. Start by determining the functions L’(θ) and L"(θ), the vector of derivatives and matrix of second derivatives, respectively, of the logarithm of the posterior density. The derivatives can be determined analytically or numerically. The mode-finding algorithm proceeds as follows:

1.  Choose a starting value, θ0.

2.  For t = 1, 2, 3, …,

(a)   Compute L’(θt-1) and L"(θt-1). The Newton’s method step at time t is based on the quadratic approximation to L(θ) centered at θt-1.

(b)   Set the new iterate, θt, to maximize the quadratic approximation; thus,

The starting value, θ0, is important; the algorithm is not guaranteed to converge from all starting values, particularly in regions where -L" is not positive definite. Starting values may be obtained from crude parameter estimates, or conditional maximization could be used to generate a starting value for Newton’s method. The advantage of Newton’s method is that convergence is extremely fast once the iterates are close to the solution, where the quadratic approximation is accurate. If the iterations do not converge, they typically move off quickly toward the edge of the parameter space, and the next step may be to try again with a new starting point.

Quasi-Newton and conjugate gradient methods

Computation and storage of -L" in Newton’s method may be costly. Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method form an approximation of -L" iteratively using only gradient information.

Conjugate gradient methods use only the gradient information but, instead of steepest descent, subsequent optimization directions are formed using conjugate direction formulas. Conjugate gradient is likely to require more iterations than Newton and quasi-Newton methods but uses less computation per iteration and requires less storage.

Numerical computation of derivatives

If the first and second derivatives of the log posterior density are difficult to determine analytically, one can approximate them numerically using finite differences. Each component of L’ can be estimated numerically at any specified value θ = (θ1, …, θd) by

where δi is a small value and, using linear algebra notation, ei is the unit vector corresponding to the ith component of θ. The values of δi are chosen based on the scale of the problem; typically, values such as 0.0001 are low enough to approximate the derivative and high enough to avoid roundoff error on the computer. The second derivative matrix at θ is numerically estimated by applying the differencing again; for each i, j:

13.2   Boundary-avoiding priors for modal summaries

Posterior modes on the boundary of parameter space

The posterior mode is a good point summary of a symmetric posterior distribution. If the posterior is asymmetric, however, the mode can be a poor point estimate.

Consider, for example, the posterior distribution for the group-level scale parameter in the 8-schools example, displayed on page 121 and repeated here as Figure 13.1. The mode of the (marginal) posterior distribution is at τ = 0, corresponding to the model in which the effects of coaching on college admissions tests are the same in all eight schools. This conclusion is consistent with the data (as indicated by zero being the posterior mode) but on substantive grounds we do not believe the true variation to be exactly zero: the coaching programs in the eight schools differed, and so the effects should vary, if only by a small amount.

Figure 13.1  Marginal posterior density, py), for the standard deviation of the population of school effects θj in the educational testing example. If we were to choose to summarize this distribution by its mode, we would be in the uncomfortable position of setting = 0, an estimate on the boundary of parameter space.

Figure 13.2  From a simple one-dimensional hierarchical model with scale parameter 0.5 and data in 10 groups: (a) Sampling distribution of the marginal posterior mode of τ under a uniform prior distribution, based on 1000 simulations of data from the model. (b) 100 simulations of the marginal likelihood, p(yτ). In this example, the point estimate is noisy and the likelihood function is not very informative about τ.

From a fully Bayesian perspective, the posterior distribution represented in Figure 13.1 is no problem. The uniform prior distribution on τ allows the possibility that this parameter can be arbitrarily small, but we are assigning a zero probability on the event that τ = 0 exactly. The resulting posterior distribution is defined on a continuous space and we can summarize it with random simulations or, if we want a point summary, by the posterior median (which in this example takes on the reasonable value of 4.9).

The problem of zero mode of the marginal likelihood does not only arise in the 8-schools example. We illustrate with a stripped-down example with J = 10 groups:

for simplicity modeling the θj’s with a normal distribution centered at zero:

In our simulation, we assume τ = 0.5.

Figure 13.3  Various possible zero-avoiding prior densities for τ, the group-level standard deviation parameter in the 8 schools example. We prefer the gamma with 2 degrees of freedom, which hits zero at τ = 0 (thus ensuring a nonzero posterior mode) but clears zero for any positive τ. In contrast, the lognormal and inverse-gamma priors effectively shut off τ in some positive region near zero, or rule out high values of τ. These are behaviors we do not want in a default prior distribution. All these priors are intended for use in constructing penalized likelihood (posterior mode) estimates; if we were doing full Bayes and averaging over the posterior distribution of τ, we would be happy with a uniform or half-Cauchy prior density, as discussed in Section 5.7.

From this model, we create 1000 simulated datasets y; for each we determine the marginal likelihood and the value at which it takes on its maximum.

Figure 13.2a shows the sampling distribution of the maximum marginal likelihood estimate of τ (in this simple example, we just solve for in the equation , with the boundary constraint that . In almost half the simulations, the marginal likelihood is maximized at = 0. There is enough noise here that it is almost impossible to do anything more than bound the group-level variance; the data do not allow an accurate estimate.

Zero-avoiding prior distribution for a group-level variance parameter

The problems in the above examples arise because the mode is taken as a posterior summary. If we are planning to summarize the posterior distribution by its mode (perhaps for computational convenience or as a quick approximation, as discussed in this chapter), it can make sense to choose the prior distribution accordingly.

What is an appropriate noninformative prior distribution, p(τ), that will avoid boundary estimates in an 8-schools-like problem in which inferences are summarized by the posterior mode? To start with, p(τ) must be zero at τ = 0.

Several convenient probability models on positive random variables have this property, including the lognormal (log τ ~ N(μτ, στ2)) which is convenient given our familiarity with the normal distribution, and the inverse-gamma (τ2 ~ Inv-gamma(ατ, βτ)) which is conditionally conjugate in the hierarchical models we have been considering. Unfortunately, both these classes of prior distribution cut off too sharply near zero. The lognormal and the inverse-gamma both have effective lower bounds, below which the prior density declines so rapidly as to effectively shut off some range of τ near zero. If the scale parameters on these models are set to be vague enough, this lower bound can be made extremely low, but then the prior becomes strongly peaked. There is thus no reasonable default setting with these models; the user must either choose a vague prior which rules out values of τ near zero, or a distribution that is highly informative on the scale of the data.

Instead we prefer a prior model such as τ ~ Gamma(2, 2/A), a gamma distribution with shape 2 and some large scale parameter. This density starts at 0 when τ = 0 and then increases linearly from there, eventually curving gently back to zero for large values of τ. The linear behavior at τ = 0 ensures that no matter how concentrated the likelihood is near zero, the posterior distribution will remain consistent with the data, a property that does not hold with the lognormal or inverse-gamma prior distributions.

Again, the purpose of this prior distribution is to give a good estimate when the posterior distribution for τ is to be summarized by its mode, as is often the case in statistical computations with hierarchical models. If we were planning to use posterior simulations, we would generally not see any advantage to the gamma prior distribution and instead would go with the uniform or half-Cauchy as default choices, as discussed in Chapter 5.

Boundary-avoiding prior distribution for a correlation parameter

We next illustrate the difficulty of estimating correlation parameters, using a simple model of a varying-intercept, varying-slope regression with a 2 × 2 group-level variance matrix.

Within each group j = 1, …, J, we assume a linear model:

For our simulation we draw the xi’s independently from a unit normal distribution and set nj = 5 for all j. As before, we consider J = 10 groups.

The two regression parameters in each group j are modeled as a random draw from a normal distribution:

As in the previous example, we average over the linear parameters θ and work with the marginal likelihood, which can be computed analytically as

where are the least-squares estimate and corresponding covariance matrix from regressing y on x for the data in group .

For this example, we assume the true values of the variance parameters are τ1 = τ2 = 0.5 and ρ = 0. For the goal of getting an estimate of ρ that is stable and far from the boundary, setting the true value to 0 should be a best-case scenario. Even here, though, it turns out we have troubles.

As before, we simulate data and compute the marginal likelihood 1000 times. For this example we are focusing on ρ so we look at the value of ρ in the maximum marginal likelihood estimate of (τ1, τ2, ρ) and we also look at the profile likelihood for ρ; that is, the function . As is standard in regression models, all these definitions are implicitly conditional on x, a point we discuss further in Section 14.1. For each simulation, we compute the profile likelihood as a function of ρ using a numerical optimization routine applied separately to each ρ in a grid. The optimization is easy enough, because the marginal likelihood function can be written in closed form. The marginal posterior density for ρ, averaging over a uniform prior on (τ1, τ2, ρ), would take more effort to work out but would yield similar results.

Figure 13.4  From a simulated varying-intercept, varying-slope hierarchical regression with identity group-level covariance matrix: (a) Sampling distribution of the maximum marginal likelihood estimate of the group-level correlation parameter, based on 1000 simulations of data from the model. (b) 100 simulations of the marginal profile likelihood, . In this example, the maximum marginal likelihood estimate is extremely variable and the likelihood function is not very informative about ρ. (In some cases, the profile likelihood for ρ is flat in some places; this occurs when the corresponding estimate of one of the variance parameters (τ1 or τ2) is zero, in which case ρ is not identified.)

Figure 13.4 displays the results. In 1000 simulations, the maximum marginal likelihood estimate of group-level correlation is on the boundary over 10% of the time, and the profile marginal likelihood for ρ is typically not very informative. In a fully Bayesian setting, we would average over ρ; in a penalized likelihood framework, we want a more stable point estimate.

If the plan is to summarize inference by the posterior mode of ρ, we would replace the U(-1, 1) prior distribution with , which is equivalent to a Beta(2, 2) on the transformed parameter ρ+1/2. The prior and resulting posterior densities are zero at the boundaries and thus the posterior mode will never be -1 or 1. However, as with the Gamma(2, 2/A) prior distribution for τ discussed earlier in this section, the prior density for ρ is linear near the boundaries and thus will not contradict any likelihood.

Degeneracy-avoiding prior distribution for a covariance matrix

More generally, we want any mode-based point estimate or computational approximation of a covariance matrix to be non-degenerate, that is, to have positive variance parameters and a positive-definite correlation matrix. Again, we can ensure this property in the posterior mode by choosing a prior density that goes to zero when the covariance matrix is degenerate. By analogy to the one- and two-dimensional solutions above, for a general d × d covariance matrix we choose the Wishart(d + 3, AI) prior density, which is zero but with a positive constant derivative at the boundary. As before, we can set A to a large value based on the context of the problem. The resulting estimate of the covariance matrix is always positive definite but without excluding estimates near the boundary if they are supported by the likelihood.

In the limit of large values of A, the Wishart(d + 3, AI) prior distribution on a covariance matrix corresponds to independent Gamma(2, δ) prior distributions on each of the d eigenvalues with δ → 0 and thus can be seen as a generalization of our default model for variance parameters given above. In two dimensions, the multivariate model in the limit A → ∞ corresponds to the prior distribution as before.

Again, we see this family of default Wishart prior distributions as a noninformative choice if the plan is to summarize or approximate inference for the covariance matrix by the posterior mode. For full Bayesian inference, there would be no need to choose a prior distribution that hits zero at the boundary; we would prefer something like a scaled inverse-Wishart model that generalizes the half-Cauchy prior distribution discussed in Section 5.7.

13.3   Normal and related mixture approximations

Fitting multivariate normal densities based on the curvature at the modes

Once the mode or modes have been found (perhaps after including a boundary-avoiding prior distribution as discussed in the previous section), we can construct an approximation based on the (multivariate) normal distribution. For simplicity we first consider the case of a single mode at , where we fit a normal distribution to the first two derivatives of the log posterior density function at :

The variance matrix is the inverse of the curvature of the log posterior density at the mode, , and this second derivative can be calculated analytically for some problems or else approximated numerically as in (13.2). As usual, before fitting a normal density, it makes sense to transform parameters as appropriate, often using logarithms and logits, so that they are defined on the whole real line with roughly symmetric distributions (remembering to multiply the posterior density by the Jacobian of the transformation, as in Section 1.8).

Laplace’s method for analytic approximation of integrals

Instead of approximating just the posterior with normal distribution, we can use Laplace’s method to approximate integrals of a a smooth function times the posterior h(θ)p(θy). The approximation is proportional to a (multivariate) normal density in θ, and its integral is just

where d is the dimension of , and θ0 is the point at which u(θ) is maximized.

If h(θ) is a fairly smooth function, this approximation can be reasonable, due to the approximate normality of the posterior distribution, p(θy), for large sample sizes (recall Chapter 4). Because Laplace’s method is based on normality, it is most effective for unimodal posterior densities, or when applied separately to each mode of a multimodal density. We use Laplace’s method to compute the relative masses of the densities in a normal-mixture approximation to a multimodal density (13.4).

Laplace’s method using unnormalized densities. If we are only able to compute the unnormalized density q(θy), we can apply Laplace’s method separately to hq and q to evaluate the numerator and denominator here:

Mixture approximation for multimodal densities

Now suppose we have found K modes in the posterior density. The posterior distribution can then be approximated by a mixture of K multivariate normal distributions, each with its own mode , variance matrix Vθk, and relative mass óMk. That is, the target density p(θy) can be approximated by

For each k, the mass óMk of the kth component of the multivariate normal mixture can be estimated by equating the posterior density, , or the unnormalized posterior density, , to the approximation, , at each of the K modes. If the modes are fairly widely separated and the normal approximation is appropriate for each mode, then we obtain

which yields the normal-mixture approximation

Multivariate t approximation instead of the normal

For a broader distribution, one can replace each normal density by a multivariate t with some small number of degrees of freedom, v. The corresponding approximation is a mixture density function that has the functional form,

where d is the dimension of θ. A value such as ν = 4, which provides three finite moments for the approximating density, has worked in some examples.

Several strategies can be used to improve the approximate distribution further, including analytically fitting the approximation to locations other than the modes, such as saddle points or tails, of the distribution; analytically or numerically integrating out some components; or moving to an iterative scheme such as variational Bayes or expectation propagation, as described in Sections 13.7 and 13.8.

Sampling from the approximate posterior distributions

It is easy to draw random samples from the multivariate normal or t-mixture approximations. To generate a single sample from the approximation, first choose one of the K mixture components using the relative probability masses of the mixture components, óMk, as multinomial probabilities. Appendix A provides details on how to sample from a single multivariate normal or t distribution using the Cholesky factorization of the scale matrix.

The sample drawn from the approximate posterior distribution can be used in importance sampling, or an improved sample can be obtained using importance resampling, as described in Section 10.4.

13.4   Finding marginal posterior modes using EM

In problems with many parameters, normal approximations to the joint distribution are often useless, and the joint mode is typically not helpful. It is often useful, however, to base an approximation on a marginal posterior mode of a subset of the parameters; we use the notation θ = (γ, ) and suppose we are interested in first approximating p(y). After approximating p(y) as a normal or t or a mixture of these, we may be able to approximate the conditional distribution, p(γ, y), as normal (or t, or a mixture) with parameters depending on . In this section we address the first problem, and in the next section we address the second.

The EM algorithm can be viewed as an iterative method for finding the mode of the marginal posterior density, p(y), and is extremely useful for many common models for which it is hard to maximize p(y) directly but easy to work with p(γ, y) and p(γ, y). Examples of the EM algorithm appear in the later chapters of this book, including Sections 18.4, 18.6, and 22.2; we introduce the method here.

If we think of as the parameters in our problem and γ as missing data, the EM algorithm formalizes a relatively old idea for handling missing data: start with a guess of the parameters and then (1) replace missing values by their expectations given the guessed parameters, (2) estimate parameters assuming the missing data are equal to their estimated values, (3) re-estimate the missing values assuming the new parameter estimates are correct, (4) re-estimate parameters, and so forth, iterating until convergence. In fact, the EM algorithm is more efficient than these four steps would suggest since each missing value is not estimated separately; instead, those functions of the missing data that are needed to estimate the model parameters are estimated jointly.

The name ‘EM’ comes from the two alternating steps: finding the expectation of the needed functions (the sufficient statistics) of the missing values, and maximizing the resulting posterior density to estimate the parameters as if these functions of the missing data were observed. For many standard models, both steps—estimating the missing values given a current estimate of the parameter and estimating the parameters given current estimates of the missing values—are straightforward. EM is widely applicable because many models, including mixture models and some hierarchical models, can be re-expressed as distributions on augmented parameter spaces, where the added parameters γ can be thought of as missing data.

Derivation of the EM and generalized EM algorithms

In the notation of this chapter, EM finds the modes of the marginal posterior distribution, p(y), averaging over the parameters γ. A more conventional presentation, in terms of missing and complete data, appears in Section 18.2. We show here that each iteration of the EM algorithm increases the value of the log posterior density until convergence. We start with the simple identity

and take expectations of both sides, treating γ as a random variable with the distribution p(γold, y), where old is the current (old) guess. The left side of the above equation does not depend on γ, so averaging over γ yields

where Eold is an average over γ under the distribution . The last term on the right side of (13.5), , is maximized at (see Exercise 13.6). The other term, the expected log joint posterior density, , is repeatedly used in computation,

This expression is called in the EM literature, where it is viewed as the expected complete-data log-likelihood.

Now consider any value new for which Eold(log p(γ, newy)) > Eold(log p(γ, oldy)). If we replace old by new, we increase the first term on the right side of (13.5), while not increasing the second term, and so the total must increase: log p(newy) > log p(oldy). This idea motivates the generalized EM (GEM) algorithm: at each iteration, determine Eold(log p(γ, y))—considered as a function of —and update to a new value that increases this function. The EM algorithm is the special case in which the new value of is chosen to maximize Eold(log p(γ, y)), rather than merely increase it. The EM and GEM algorithms both have the property of increasing the marginal posterior density, p(y), at each iteration.

Because the marginal posterior density, p(y), increases in each step of the EM algorithm, and because the Q function is maximized at each step, EM converges to a local mode of the posterior density except in some special cases. (Because the GEM algorithm does not maximize at each step, it does not necessarily converge to a local mode.) The rate at which the EM algorithm converges to a local mode depends on the proportion of ‘information’ about in the joint density, p(γ, y), that is missing from the marginal density, p(y). It can be slow to converge if the proportion of missing information is large; see the bibliographic note at the end of this chapter for many theoretical and applied references on this topic.

Implementation of the EM algorithm

The EM algorithm can be described algorithmically as follows.

1.  Start with a crude parameter estimate, 0.

2.  For t = 1, 2, …:

(a)   E-step: Determine the expected log posterior density function,

where the expectation averages over the conditional posterior distribution of γ, given the current estimate, old = t-1.

(b)   M-step: Let t be the value of that maximizes Eold(log p(γ, y)). For the GEM algorithm, it is only required that Eold(log p(γ, y)) be increased, not necessarily maximized.

As we have seen, the marginal posterior density, p(y), increases at each step of the EM algorithm, so that, except in some special cases, the algorithm converges to a local mode of the posterior density.

Finding multiple modes. A simple way to search for multiple modes with EM is to start the iterations at many points throughout the parameter space. If we find several modes, we can roughly compare their relative masses using a normal approximation, as described in the previous section.

Debugging. A useful debugging check when running an EM algorithm is to compute the logarithm of the marginal posterior density, log p(ty), at each iteration, and check that it increases monotonically. This computation is recommended for all problems for which it is relatively easy to compute the marginal posterior density.

Example. Normal distribution with unknown mean and variance and partially conjugate prior distribution

Suppose we weigh an object on a scale n times, and the weighings, y1, …, yn, are assumed independent with a N(μ, σ2) distribution, where μ is the true weight of the object. For simplicity, we assume a N(μ0, τ02) prior distribution on μ, (with μ0 and τ0 known) and the standard noninformative uniform prior distribution on log σ; these form a partially conjugate joint prior distribution.

Because the model is not fully conjugate, there is no standard form for the joint posterior distribution of (μ, σ) and no closed-form expression for the marginal posterior density of μ. We can, however, use the EM algorithm to find the marginal posterior mode of μ, averaging over σ; that is, (μ, σ) corresponds to (, γ) in the general notation.

Joint log posterior density. The logarithm of the joint posterior density is

ignoring terms that do not depend on μ or σ2.

E-step. For the E-step of the EM algorithm, we must determine the expectation of (13.6), averaging over σ and conditional on the current guess, μold, and y:

We must now evaluate Eold(log σ) and Eold(1/σ2). Actually, we need evaluate only the latter expression, because the former expression is not linked to μ in (13.7) and thus will not affect the M-step. The expression Eold(1/σ2) can be evaluated by noting that, given μ, the posterior distribution of σ2 is that for a normal distribution with known mean and unknown variance, which is scaled inverse-2:

Then the conditional posterior distribution of 1/σ2 is a scaled -2, and

We can then re-express (13.7) as

M-step. For the M-step, we must find the μ that maximizes the above expression. For this problem, the task is straightforward, because (13.8) has the form of a normal log posterior density, with prior distribution μ ~ N(μ0, τ02) and n data points yi, each with variance 1/n σi=1n (yi - μold)2. The M-step is achieved by the mode of the equivalent posterior density, which is

If we iterate this computation, μ converges to the marginal mode of p(μy).

Extensions of the EM algorithm

Variants and extensions of the basic EM algorithm increase the range of problems to which the algorithm can be applied, and some versions can converge much more quickly than the basic EM algorithm. In addition, the EM algorithm and its extensions can be supplemented with calculations that obtain the second derivative matrix for use as an estimate of the asymptotic variance at the mode. We describe some of these modifications here.

The ECM algorithm. The ECM algorithm is a variant of the EM algorithm in which the M-step is replaced by a set of conditional maximizations, or CM-steps. Suppose that t is the current iterate. The E-step is unchanged: the expected log posterior density is computed, averaging over the conditional posterior distribution of γ given the current iterate. The M-step is replaced by a set of S conditional maximizations. At the sth conditional maximization, the value of t+s/S is found that maximizes the expected log posterior density among all such that gs() = gs(t+(s-1)/S) with the gs(·) known as constraint functions. The output of the last CM-step, t+S/S = t+1, is the next iterate of the ECM algorithm. The set of constraint functions, gs(·), s = 1, …, S, must satisfy certain conditions in order to guarantee convergence to a stationary point. The most common choice of constraint function is the indicator function for the sth subset of the parameters. The parameter vector is partitioned into S disjoint and exhaustive subsets, (1, …, S), and at the sth conditional maximization step, all parameters except those in s are constrained to equal their current values, jt+s/S = jt+(s-1)/S for js. An ECM algorithm based on a partitioning of the parameters is an example of a generalized EM algorithm. Moreover, if each of the CM steps maximizes by setting first derivatives equal to zero, then ECM shares with EM the property that it will converge to a local mode of the marginal posterior distribution of . Because the log posterior density, p(y), increases with every iteration of the ECM algorithm, its monotone increase can still be used for debugging.

As described in the previous paragraph, ECM performs several CM-steps after each E-step. The multicycle ECM algorithm performs additional E-steps during a single iteration. For example, one might perform an additional E-step before each conditional maximization. Multicycle ECM algorithms require more computation for each iteration than the ECM algorithm but can sometimes reach approximate convergence with fewer total iterations.

The ECME algorithm. The ECME algorithm is an extension of ECM that replaces some of the conditional maximization steps of the expected log joint density, Eold(log p(γ, y)), with conditional maximization steps of the actual log posterior density, log p(y). The last E in the acronym refers to the choice of maximizing either the actual log posterior density or the expected log posterior density. Iterations of ECME also increase the log posterior density at each iteration. Moreover, if each conditional maximization sets first derivatives equal to zero, ECME will converge to a local mode.

ECME can be especially helpful at increasing the rate of convergence, since the actual marginal posterior density is being increased on some steps rather than the full posterior density averaged over the current estimate of the distribution of the other parameters. The increase in speed of convergence can be dramatic when faster converging numerical methods (such as Newton’s method) are applied directly to the marginal posterior density on some of the CM-steps. For example, if one CM-step requires a one-dimensional search to maximize the expected log joint posterior density then the same effort can be applied directly to the logarithm of the marginal posterior density of interest.

The AECM algorithm. The ECME algorithm can be further generalized by allowing different alternating definitions of γ at each conditional maximization step. This generalization is most straightforward when represents missing data, and where there are different ways of completing the data at different maximization steps. In some problems the alternation can allow much faster convergence. The AECM algorithm shares with EM the property of converging to a local mode with an increase in the posterior density at each step.

Supplemented EM and ECM algorithms

The EM algorithm is attractive because it is often easy to implement and has stable and reliable convergence properties. The basic algorithm and its extensions can be enhanced to produce an estimate of the asymptotic variance matrix at the mode, which is useful in forming approximations to the marginal posterior density. The supplemented EM (SEM) algorithm and the supplemented ECM (SECM) algorithm use information from the log joint posterior density and repeated EM- or ECM-steps to obtain the approximate asymptotic variance matrix for the parameters .

To describe the SEM algorithm we introduce the notation M() for the mapping defined implicitly by the EM algorithm, . The asymptotic variance matrix V is

where DM is the Jacobian matrix for the EM map evaluated at the marginal mode, , and Vjoint is the asymptotic variance matrix based on the logarithm of the joint posterior density averaged over γ,

Typically, Vjoint can be computed analytically so that only DM is required. The matrix DM is computed numerically at each marginal mode using the E- and M-steps according to the following algorithm.

1.  Run the EM algorithm to convergence, thereby obtaining the marginal mode, . (If multiple runs of EM lead to different modes, apply the following steps separately for each mode.)

2.  Choose a starting value 0 for the SEM calculation such that 0 does not equal for any component. One possibility is to use the same starting value that is used for the original EM calculation.

3.  Repeat the following steps to get a sequence of matrices Rt, t = 1, 2, 3, …, for which each element rijt converges to the appropriate element of DM. In the following we describe the steps used to generate Rt given the current EM iterate, t.

(a)   Run the usual E-step and M-step with input t to obtain t+1.

(b)   For each element of , say i:

i   Define t(i) equal to except for the ith element, which is replaced by its current value it.

ii   Run one E-step and one M-step treating t(i) as the input value of the parameter vector, . Denote the result of these E- and M-steps as t+1(i). The ith row of Rt is computed as

When the value of an element rij no longer changes, it represents a numerical estimate of the corresponding element of DM. Once all of the elements in a row have converged, then we need no longer repeat the final step for that row. If some elements of are independent of γ, then EM will converge immediately to the mode for that component with the corresponding elements of DM equal to zero. SEM can be easily modified in such cases to obtain the variance matrix.

The same approach can be used to supplement the ECM algorithm with an estimate of the asymptotic variance matrix. The SECM algorithm is based on the following result:

with defined and computed in a manner analogous to DM in the above discussion except with ECM in place of EM, and where DMCM is the rate of convergence for conditional maximization applied directly to log p(y). This latter matrix depends only on Vjoint and , the gradient of the vector of constraint functions gs at :

These gradient vectors are trivial to calculate for a constraint that directly fixes components of . In general, SECM appears to require analytic work to compute Vjoint and ▿s, s = 1, …, S, in addition to applying the numerical computation for DMECM, but some of these calculations can be performed using results from the ECM iterations.

Parameter-expanded EM (PX-EM)

The various methods discussed in Section 12.1 for improving the efficiency of Gibbs samplers have analogues for mode-finding (and in fact were originally constructed for that purpose). For example, the parameter expansion idea illustrated with the t model on page 295 was originally developed in the context of the EM algorithm. In this setting, the individual latent data variances Vi are treated as missing data, and the ECM algorithm maximizes over the parameters μ, τ, and α in the posterior distribution.

13.5   Approximating conditional and marginal posterior densities

Approximating the conditional posterior density, p

As stated at the beginning of Section 13.4, the normal, multivariate t, and other analytically convenient distributions can be poor approximations to a joint posterior distribution. Often, however, we can partition the parameter vector as , in such a way that an analytic approximation works well for the conditional posterior density, . In general, the approximation will depend on . We write the approximate conditional density as . For example, in the hierarchical model in Section 5.4, we fitted a normal distribution to (in that example, the normal conditional distribution is an exact fit).

Approximating the marginal posterior density, , using an analytic approximation to

The mode-finding techniques and normal approximation of Sections 13.1 and 13.3 can be applied directly to the marginal posterior density if the marginal distribution can be obtained analytically. If not, then the EM algorithm (Section 13.4) may allow us to find the mode of the marginal posterior density and construct an approximation. On occasion it is not possible to construct an approximation to p(y) using any of those methods. Fortunately we may still derive an approximation if we have an analytic approximation to the conditional posterior density, p(γ, y). We can use a trick used in (5.19) in Section 5.4 to generate an approximation to p(y). The approximation is constructed as the ratio of the joint posterior distribution to the analytic approximation of the conditional posterior distribution:

The key to this method is that if the denominator has a standard analytic form, we can compute its normalizing factor, which, in general, depends on . When using (13.9), we must also specify a value γ (possibly as a function of ) since the left side does not involve γ at all. If the analytic approximation to the conditional distribution is exact, the factors of γ in the numerator and denominator cancel, and we obtain the marginal posterior density exactly. If the analytic approximation is not exact, a natural value to use for γ is the center of the approximate distribution (for example, E(γ, y) under the normal or t approximations).

For example, suppose we have approximated the d-dimensional conditional density function, p(γ, y), by a multivariate normal density with mean and scale matrix Vγ, both of which depend on . We can then approximate the marginal density of by

where is included in parentheses to indicate that the mean and scale matrix must be evaluated at each value of . The same result holds if a t approximation is used; in either case, the normalizing factor in the denominator of (13.9) is proportional to Vγ()-1/2.

Improving an approximation using importance resampling We can improve the approximation with importance sampling, using draws of γ from each value of at which the approximation is computed. For any given value of , we can write the marginal posterior density as

where Eapprox averages over γ using the conditional posterior distribution, papprox(γ, y). The importance sampling estimate of p(y) can be computed by simulating S values γs from the approximate conditional distribution, computing the joint density and approximate conditional density at each γs, and then averaging the S values of . This procedure is then repeated for each point on the grid of . If the normalizing constant for the joint density p(γ, y) is itself unknown, then more complicated computational procedures must be used.

13.6   Example: hierarchical normal model (continued)

We illustrate mode-based computations with the hierarchical normal model that we used in Section 11.6. In that section, we illustrated the Gibbs sampler and the Metropolis algorithm as two different ways of drawing posterior samples. In this section, we describe how to get approximate inference by finding the mode of p(μ, log σ, log τy), the marginal posterior density, and a normal approximation centered at the mode. Given (μ, log σ, log τ), the individual means θj have independent normal conditional posterior distributions.

Table 13.1  Convergence of stepwise ascent to a joint posterior mode for the coagulation example. The joint posterior density increases at each conditional maximization step, as it should. The posterior mode is in terms of log σ and log τ, but these values are transformed back to the original scale for display in the table.

Crude initial parameter estimates

Initial parameter estimates for the computations are easily obtained by estimating θj as . j, the average of the observations in the jth group, for each j, and estimating σ2 as the average of the J within-group sample variances, . We then crudely estimate μ and τ2 as the mean and variance of the J estimated values of θj. For the coagulation data in Table 11.2, the crude estimates are shown in the first column of Table 13.1.

Conditional maximization to find the joint mode of p(θ, μ, log σ, log τy)

Because of the conjugacy of the normal model, it is easy to perform conditional maximization on the joint posterior density, updating each parameter in turn by its conditional mode. In general, we analyze scale parameters such as σ and τ on the logarithmic scale. The conditional modes for each parameter are easy to compute, especially because we have already determined the conditional posterior density functions in computing the Gibbs sampler for this problem in Section 11.6. After obtaining a starting guess for the parameters, the conditional maximization proceeds as follows, where the parameters can be updated in any order.

1.  Conditional mode of each θj. The conditional posterior distribution of θj, given all other parameters in the model, is normal and given by (11.10). For j = 1, …, J, we can maximize the conditional posterior density of θj given (μ, σ, τ, y) (and thereby increase the joint posterior density), by replacing the current estimate of θj by in (11.10).

2.  Conditional mode of μ. The conditional posterior distribution of μ, given all other parameters in the model, is normal and given by (11.12). For conditional maximization, replace the current estimate of μ by in (11.13).

3.  Conditional mode of log σ. The conditional posterior density for σ2 is scaled inverse-2 and given by (11.14). The mode of the conditional posterior density of log σ is obtained by replacing the current estimate of log σ with log , with defined in (11.15). (From Appendix A, the conditional mode of . The factor of n/n+2 does not appear in the conditional mode of log σ because of the Jacobian factor when transforming from p(σ2) to p(log σ); see Exercise 13.7.)

4.  Conditional mode of log τ. The conditional posterior density for τ2 is scaled inverse-2 and given by (11.16). The mode of the conditional posterior density of log τ is obtained by replacing the current estimate of log τ with log , with defined in (11.17).

Numerical results of conditional maximization for the coagulation example are presented in Table 13.1, from which we see that the algorithm has required only three iterations to reach approximate convergence in this small example. We also see that this posterior mode is extremely close to the crude estimate, which occurs because the shrinkage factors are all near zero. Incidentally, the mode displayed in Table 13.1 is only a local mode; the joint posterior density also has another mode at the boundary of the parameter space; we are not especially concerned with that degenerate mode because the region around it includes little of the posterior mass (see Exercise 13.8).

In a simple applied analysis, we might stop here with an approximate posterior distribution centered at this joint mode, or even just stay with the simpler crude estimates. In other hierarchical examples, however, there might be quite a bit of pooling, as in the educational testing problem of Section 5.5, in which case it is advisable to continue the analysis, as we describe below.

Factoring into conditional and marginal posterior densities

As discussed, the joint mode often does not provide a useful summary of the posterior distribution, especially when J is large relative to the nj’s. To investigate this possibility, we consider the marginal posterior distribution of a subset of the parameters. In this example, using the notation of the previous sections, we set γ = (θ1, …, θJ) = θ and = (μ, log σ, log τ), and we consider the posterior distribution as the product of the marginal posterior distribution of and the conditional posterior distribution of θ given . The subvector (μ, log σ, log τ) has only three components no matter how large J is, so we expect asymptotic approximations to work better for the marginal distribution of than for the joint distribution of (γ, ).

From (11.9) in the Gibbs sampling analysis of the coagulation data in Chapter 11, the conditional posterior density of the normal means, p(θμ, σ, τ, y), is a product of independent normal densities with means and variances Vθj that are easily computable functions of (μ, σ, τ, y).

The marginal posterior density, p(μ, log σ, log τy), of the remaining parameters, can be determined using formula (13.9), where the conditional distribution papprox(θμ, log σ, log τ, y) is actually exact. Thus,

Because the denominator is exact, this identity must hold for any θ; to simplify calculations, we set θ = , to yield

with the final factor coming from the normalizing constant of the normal distribution in the denominator, and where are defined by (11.11).

Finding the marginal posterior mode of p(μ, log σ, log τy) using EM

The marginal posterior mode of (μ, σ, τ)—the maximum of (13.12)—cannot be found analytically because the and Vθj’s are functions of (μ, σ, τ). One possible approach is Newton’s method, which requires computing derivatives and second derivatives analytically or numerically. For this problem, however, it is easier to use the EM algorithm.

To obtain the mode of p(μ, log σ, log τy) using EM, we average over the parameter θ in the E-step and maximize over (μ, log σ, log τ) in the M-step. The logarithm of the joint posterior density of all the parameters is

E-step. The E-step, averaging over θ in (13.13), requires determining the conditional posterior expectations Eold((θj - μ)2) and Eold((yij - θj)2) for all j. These are both easy to compute using the conditional posterior distribution p(θμ, σ, τ, y), which we have already determined in (11.9).

Using a similar calculation,

For both expressions, are computed from (11.11) based on (μ, log σ, log τ)old.

M-step. It is now straightforward to maximize Eold(log p(θ, μ, log σ, log τy)) as a function of (μ, log σ, log τ). The maximizing values are (μnew, log σnew, log τnew), with (μ, σ, τ)new obtained by maximizing (13.13):

The derivation of these is straightforward (see Exercise 13.9).

Checking that the marginal posterior density increases at each step. Ideally, at each iteration of EM, we would compute the log of (13.12) using the just calculated (μ, log σ, log τ)new. If the function does not increase, there is a mistake in the analytic calculations or the programming, or possibly a roundoff error, which can be checked by altering the precision of the calculations.

We apply EM to the coagulation example, using the values of (σ, μ, τ) from the joint mode as a starting point; numerical results appear in Table 13.2, where we see that the EM algorithm has approximately converged after only three steps. As typical in this sort of problem, the variance parameters σ and τ are larger at the marginal mode than the joint mode. The logarithm of the marginal posterior density, log p(μ, log σ, log τy), has been computed to the (generally unnecessary) precision of three decimal places for the purpose of checking that it does, indeed, monotonically increase. (If it had not, we would have debugged the program before including the example in the book!)

Table 13.2  Convergence of the EM algorithm to the marginal posterior mode of (μ, log σ, log τ) for the coagulation example. The marginal posterior density increases at each EM iteration, as it should. The posterior mode is in terms of log σ and log τ, but these values are transformed back to the original scale for display in the table.

Table 13.3  Summary of posterior simulations for the coagulation example, based on draws from the normal approximation to p(μ, log σ, log τy) and the exact conditional posterior distribution, p(θμ, log σ, log τ, y). Compare to joint and marginal modes in Tables 13.1 and 13.2.

Constructing an approximation to the joint posterior distribution

Having found the mode, we can construct a normal approximation based on the 3 × 3 matrix of second derivatives of the marginal posterior density, p(μ, log σ, log τy), in (13.12). To draw simulations from the approximate joint posterior density, first draw (μ, log σ, log τ) from the approximate normal marginal posterior density, then θ from the conditional posterior distribution, p(θμ, log σ, log τ, y), which is already normal and so does not need to be approximated. Table 13.3 gives posterior intervals for the model parameters from these simulations.

Comparison to other computations

If we determine that the approximate inferences are not adequate, the approximation that we have developed can still serve us as a comparison point to more complicated algorithms, and also to obtain starting points. For example, we can obtain a roughly overdispersed approximation to the target distribution by sampling from the t4 approximation for (μ, log σ, log τ), and then we can subsample using importance resampling (see Section 10.4) and use these as starting points for the iterative simulations.

13.7   Variational inference

Variational Bayes is an algorithmic framework, similar to EM, for approximating a joint distribution. EM proceeds by alternately evaluating conditional expectations of the log density and using these to maximize a function of a set of hyperparameters (which in turn define the conditional distribution used to compute the expectation in the next step, and so forth), converging to a point estimate of the hyperparameters and thus an approximation to the posterior distribution. In variational Bayes, the iterations lead to a closed-form approximation that is the closest fit to the posterior distribution (in a sense defined below) within some specified class of functions.

Minimization of Kullback-Leibler divergence

In variational Bayes, a parametric approximation g(θ) is constructed iteratively using an expectation procedure that, as we shall show, has the effect of minimizing the Kullback-Leibler divergence from the target posterior distribution p(θy),

The absolute minimum of this divergence is 0, which is attained when gp. The difficulty is that variational Bayes is typically applied in settings where we cannot directly summarize p—that is, we cannot easily take posterior draws from p(θy), nor can we easily compute expectations of interest, Ep(h(θ)). In variational Bayes we work with some simpler parameterized class of distributions g that are easier to handle.1

Here we shall use the notation for the hyperparameters of the variational approximation. Thus we write our approximating function g as g(θ), and the algorithm proceeds by starting with some guess of and then iteratively updating it in a way that is mathematically guaranteed to decrease the Kullback-Leibler divergence (13.15) at each step. As with EM, at some point no longer makes any visible changes. At that point we stop the iteration and use g(θ) given the most recent update of as our approximation to the posterior density. It can make sense to check the results by running the algorithm several times from different starting points.

The class of approximate distributions

There are various ways of defining the class of distributions for the variational approximation, g(θ). A standard approach is to constrain the components of θ to be independent; thus, for a J-dimensional parameter θ. In that case, the family of distributions gj over which to optimize can be determined from the mathematical form of the posterior density function, p(θy).

It works like this: for each j, we examine the expectation of the log posterior density, log p(θy), considering it as a function of θj, averaging over the distributions g-j that represent the other J - 1 dimensions of θ. This is similar to Gibbs sampling except that we are interested in the average rather than the conditional density. At this point in setting up the variational Bayes algorithm, we do not yet need to evaluate the expectation Eg-j(log p(θy)); we merely need to figure out its mathematical form as a function of θj. Once we have done this for each parameter θj, we have determined the functional forms of the approximating distributions, .

As with EM, variational Bayes works best on exponential family models with conditionally conjugate prior distributions, in which case the approximating distributional families can typically be determined by inspection and the necessary expectations can be calculated in closed form. In nonconjugate models, variational Bayes can still be done by working within restricted functional forms such as normal distributions.

The variational Bayes algorithm

Once the classes of approximating distributions gj(θjj) have been identified, the computation begins with guesses of all the hyperparameters . We then cycle through the distributions gj, in each of these steps updating the vector of hyperparameters . We use the notation Eg-j to indicate an average over the approximating distribution of all the parameters other than θj, conditional on the current iteration of g-j. The result is a J-1-dimensional integral, but for many models we are able to evaluate these expectations analytically.

We shall sketch the proof that the steps of variational Bayes decrease KL(gp) and thus gradually bring the approximating distribution g(θ) closer to the target posterior distribution p(θy). But first we show how the algorithm works in a simple but nontrivial example.

Example. Educational testing experiments

We illustrate with the hierarchical model for the 8 schools from Section 5.5. As with our demonstration of HMC in Section 12.5, we label the eight school effects (defined as θj in Chapter 5) as αj; the full vector of parameters θ then has 10 dimensions, corresponding to α1, …, α8, μ, τ, and the log posterior density is

We shall follow standard practice with variational Bayes and approximate p(θ) by a product of independent densities; thus,

Determining the form of the approximating distributions. We begin variational Bayes as follows. For each of the ten variables in the model, we inspect the log posterior density, consider its expectation averaging over the independent distribution g for the other nine variables, and determine its parametric form:

•  For each αj, we look at E log p, averaging over the other seven α’s, μ, and τ; that is, averaging (13.16) over all the factors of (13.17) except for gj). The result is a quadratic function of αj. For this inspection, the details of the averaging distributions g are irrelevant; it is enough to know that they are independent. All we need to do is look at (13.16) and consider what will happen if we average over all parameters other than αj. The result is,

where various expectations that do not involve αj can be swept into the constant term. To be more explicit, we could write the expectation above as Egj log p, indicating that it averages over all the factors of g in (13.17) except for gαj.

In any case, we recognize E log p as a quadratic function of αj and thus eE log p(θy) is proportional to a normal density when considered as a function of αj. We can identify the parameters of this normal distribution by completing the square of the quadratic expression or, more intuitively from a statistical perspective, recognizing the expression as equivalent to two pieces of information, one centered at yj with inverse-variance σj-2 and one centered at E(μ) with inverse-variance E(1/τ2). We combine these by weighting the means and adding the inverse-variances, thus getting the following form for the variational Bayes component for αj:

•  For μ, we inspect (13.16). Averaging over all the parameters other than μ, the expression E log p(θy) has the form -1/2E(1/τ2) σj=18(E(αj) - μ)2 + const. As above, this is the logarithm of a normal density function; the parameters of this distribution can be determined by considering it as a combination of 8 pieces of information:

•  Finally, averaging over all parameters other than τ gives a density function that can be recognized as inverse-gamma or, in the parameterization we prefer,

with the expectation E((αj - μ)2) over the approximating distribution g.

The above expressions are essentially identical to the derivations of the conditional distributions for the Gibbs sampler for the hierarchical normal model in Section 11.6 and the EM algorithm in Section 13.6, with the only difference being that in the 8-schools example we assume the data variances σj are known.

Determining the conditional expectations. Rewriting the above factors in generic notation, we have:

We will need these to get the conditional expectations for each of the above three steps:

•  To specify the distribution for αj in (13.18), we need E(μ), which is Mμ from (13.22), and E(1/τ2), which is 1/Mτ2 from (13.23).

•  To specify the distribution for μ in (13.19), we need E(αj), which is Mαj from (13.21) and E(1/τ2), which is 1/Mτ2 from (13.23).

•  To specify the distribution for τ in (13.20), we need E((αj - μ)2), which is from (13.21) and (13.22), and using the assumption that the densities g are independent in the variational approximation.

Figure 13.5  Progress of variational Bayes for the parameters governing the variational approximation for the hierarchical model for the 8 schools. After a random starting point, the parameters require about 50 iterations to reach approximate convergence. The lower-right graph shows the Kullback-Leibler divergence KL(gp) (calculated up to an arbitrary additive constant); KL(gp) is guaranteed to uniformly decrease if the variational algorithm is programmed correctly.

Starting values. To get variational Bayes started, we need to initialize, not the variables α, μ, τ, but the parameters in their distributions g(α), g(μ), g(τ). For simplicity we draw the unbounded parameters Mα1, …, Mα8, Mμ from independent N(0, 1) distributions and draw the bounded parameters Sα1, …, Sα8, Sμ from independent U(0, 1) distributions.

Running the algorithm. We iterate through α1, …, α8, μ, τ, at each iteration updating the distributions (13.18)–(13.20) using the expectations from the current values of the other distributions. That is, we compute the parameters in (13.18)–(13.20) by plugging in the expectations described in the bullet points above, using the current values of M’s and S’s. Then we turn around and label the newly computed means and standard deviations in (13.18)–(13.20 as the updated M’s and S’s. The algorithm thus looks a lot like EM, with the difference that it is distributions, rather than point estimates, that are being updated. The first five plots in Figure 13.5 show the progress of the parameters of the distributions (13.21)–(13.23). With these particular starting points, the algorithm takes awhile to get moving, but once it gets unstuck, it quickly finds a solution.

Checking that the fit is improving. As noted above, the Kullback-Leibler divergence (13.15) should decrease in each step of variational Bayes. In this example we can evaluate the expression analytically and so we do. Because we are comparing these values and do not care about their absolute level, so we can simplify our analysis by ignoring constants that do not depend on the parameters θ.

For the example at hand, the Kullback-Leibler divergence is,

The lower-right plot in Figure 13.5 shows the steady decrease in KL(gp) as the algorithm progresses.

Figure 13.6  Progress of inferences for the effects in schools A, B, and C, for 100 iterations of variational Bayes. The lines and shaded regions show the median, 50% interval, and 90% interval for the variational distribution. Shown to the right of each graph are the corresponding quantiles for the full Bayes inference as computed via simulation.

Comparing variational and full Bayes solutions. Figure 13.6 shows the progress of the variational algorithm for three of the parameters, α1, α2, α3, corresponding to the effect of the coaching programs of the first three schools in the dataset. The functional form here is Gaussian so it will necessarily fail at capturing some of the subtleties of the posterior distribution, as can be seen by the comparison to the asymmetric full Bayes intervals in this example. In addition, this variational fit does not allow for dependence among the αj’s and thus would be inappropriate for some purposes. That said, the approximation fits the marginal distributions fairly well, and variational Bayes represents a fast and scalable approach for inference in this sort of problem with large datasets.

Unlike MCMC, which eventually converges to the posterior density, the variational inference converges to an approximation—the closest fit within a restricted class. So in a case like this, where we can also run MCMC long enough for convergence, it makes sense to try to understand variational Bayes by comparing it to the actual posterior density.

Proof that each step of variational Bayes decreases the Kullback-Leibler divergence

The Kullback-Leibler divergence (13.15) is defined using the (normalized) posterior density, p(θy). The first step in understanding variational Bayes is to re-express in terms of the unnormalized density, p(θ, y) = p(θ)p(yθ) which we can calculate. The re-expression goes like this: , and thus,

We cannot in general evaluate that last term, log p(y), but for the purpose of variational inference all we need to realize is that it does not depend on g. Hence, for any given model p and data y, decreasing the first term on the right of (13.24) is equivalent to decreasing KL(gp). This expression, Eg (log (p(θ, y)/g(θ)), is called the variational lower bound.

Next we show that each step of the variational algorithm is guaranteed to increase the variational lower bound, or equivalently to decrease the global Kullback-Leibler divergence (13.24). Consider the step where the distribution gj(θj) is being updated. We shall decompose the variational lower bound using the factorization we have assumed for the approximating distribution: :

We were able to turn the double integrals into single integrals in the last line above because we have assumed that gj and g-j are normalized probability densities and thus integrate to 1.

Here we are considering only the step at which gj is being updated, so we can ignore the last term in (13.25) as it only depends on g-j, and we can consider the expression in brackets in the first term to be (temporarily) a constant in that it does not depend on gj:

As θ-j has been integrated out, expression (13.26) is a function only of θj and y, and it can be considered as the logarithm of an unnormalized probability density of θj, which we shall call,

That is, log as a function of θj and taking the expectation over all the other components of θ, averaging the current iteration of g-j (as illustrated in detail for the 8-schools example earlier in this section). Expression (13.25) then becomes,

This last expression is just KL, the Kullback-Leibler divergence of gj with respect to , and it is minimized when as defined in (13.27).

Thus, when it is possible to evaluate the expectations in (13.26) and thus determine gj, we have our update, and the variational algorithm is guaranteed to decrease the global Kullback-Leibler divergence (13.15) at each step. If the expectation (13.26) cannot be done in closed form, what is needed is some update to gj that decreases (13.28), thus bringing in that step.

Model checking

The variational Bayes approximation is a generative model—that is, a proper probability distribution for the parameters θ; thus we can check its fit by drawing a sample θs, s =1, …, S from the fitted g(θ) and then, for each θs, drawing a replicated dataset yrep, s. For the 8-schools example, we would first draw 1000 replications of the parameters α1, …, α8 from the final g in our variational Bayes calculation, then sample 1000 replications for data for the 8 schools. Or, if we were interested in predictions for new schools, we would first draw 1000 values of (μ, τ) from g, then, for each of these simulations, draw 8 new schools from N(μ, τ2), and then draw one new data observation from each new school (conditional on some assumed σ).

In general we would expect that, compared to full Bayes, variational inferences would provide a better fit to observed data. As a point estimate of the distribution, variational Bayes can overfit the data. Nonetheless, a model check can be a good idea as it could still reveal problems with the inferences.

Variational Bayes followed by importance sampling or particle filtering

Variational methods are commonly used as an approximate method when simulation-based full Bayes is too computationally expensive, as with very large models or datasets. In such cases it might make sense to use the variational estimate as a starting point for a stochastic algorithm leading to a better approximation to the target distribution.

The simplest idea would be importance sampling: in the sorts of problems where variational methods are tractable, we can easily compute both the target density p(θy) and the approximation g(θ), and we can also get fast simulations directly from g. We can then compute S simulation draws, θs from g and, for each, compute the importance weight p(θsy)/g(θs). (As usual, we only need these weights up to an arbitrary multiplicative constant; thus it would be fine to use unnormalized densities in place of g or p.) We could then compute expectations using weighted averages.

Unfortunately, the direct use of importance weighting from a variational approximation can be disastrous, because the variational Bayes fit tends to be less variable than the target distribution, hence the distribution of importance ratios can have long tails, leading to unstable averages. So instead we would recommend importance resampling, in which we first draw from g, then resample without replacement using the importance weights. (It is crucial to resample without replacement so that any sampled points with extremely high weights do not dominate the simulations.) As in general with importance resampling, it is not always clear how many draws to take at each of the two stages of sampling.

A more general approach would be particle filtering, again using draws from the variational Bayes as a starting point and then moving through the target density using Metropolis or Hamiltonian Monte Carlo and splitting and removing points as appropriate. Implementing this for any particular example could represent a large investment in programming time, but for a large problem, or in a computing environment in which particle filtering has already been set up, it could make sense.

EM as a special case of variational Bayes

Variational inference proceeds in J steps, each time updating one conditional distribution gj, averaging over the other factors of g. EM has two steps (the E-step and the M-step), alternately estimating a parameter and averaging over the other parameters γ. EM can be seen as a special case of variational Bayes in which (a) the parameters are partitioned into two parts, and γ, (b) the approximating distribution for is required to be a point mass (thus, updating g() is equivalent to updating the point estimate of ), and (c) the approximating distribution for γ is unconstrained; thus g(γ) = p(γ, y), conditional on the most recent update of .

More general forms of variational Bayes

Variational inference as described above is an approach for approximating a target distribution p over some class of approximating distributions g using an iterative algorithm that at each step reduces the Kullback-Leibler divergence, KL(gp). As noted above, this can be done in closed form only for models with certain conjugacy properties. Such models include many important special cases (such as normal distributions and finite mixtures), but more generally the idea of variational Bayes can be extended by replacing the objective function (the criterion to be minimized) with an approximation of KL(gp). For many problems, including logistic regression, good approximations are available, so that an algorithm that optimizes over this new criterion yields a good approximation to the posterior distribution. Typically the approximation itself changes with each step, being defined based on the most recent update of the approximating function g.

13.8   Expectation propagation

Expectation propagation (EP) is another deterministic iterative algorithm in which the posterior distribution p(θy) is approximated by a best-fit distribution from some specified parametric family. Expectation propagation differs from variational Bayes in its optimization criterion and also in the nature of how it is computed. We shall first describe the algorithm in general and then go through the steps of applying it to logistic regression.

We start with the target distribution p(θy), which we shall write as f(θ), suppressing the dependence on y which is not directly relevant for these computations. We assume that f has some convenient factorization,

As with many Bayesian computations, all we need for the fi’s are the unnormalized density functions.

Expectation propagation can be expressed more generally, but in this description it is convenient to think of f0(θ) as the prior density and each fi(θ) as the likelihood for one data point. The computational advantage of the factorization arises from the possibility of computing certain expectations rapidly when the density is factorized in this way, as we discuss below.

As with variational Bayes, expectation propagation works by iteratively approximating the target distribution by some g(θ) which is constrained to follow some parametric form. The algorithm begins with some guess for g and then proceeds via an iterative updating. A key difference between the two methods is that variational inference is typically based on a separation of g into factors for each parameter (thus, g(θ) = πj=1J gj(θj)), whereas expectation propagation factorizes g based on a partition of the data; thus,

At convergence, each factor of (13.30) is intended to approximate the corresponding factor of (13.29). The trick is that these approximations are done one at a time, but in the context of the entire distribution.

Exponential families, sufficient statistics, and natural parameters. The approximating distribution g(θ) should be in the exponential family—as discussed on page 36, this means that the density can be written as a normalizing function times the exponential of a linear function of sufficient statistics of θ. For example, take the normal distribution:

In this parameterization, the normalizing function is the ugly-looking 1/√2πσ exp(-μ2/2σ2), but this does not really matter. What is important are the sufficient statistics, θ and θ2. In expectation propagation, we compute the expectations of the sufficient statistics under various combinations of the approximating distribution and the target distribution. The coefficients of the sufficient statistics inside the exponential of the above expression are called the natural parameters of the model.

In typical applications of expectation propagation, the approximating distribution is restricted to the multivariate normal family; thus, g(θ) = N(θμ, σ). Here there are two sufficient statistics: the vector θ and the outer-product matrix θθT, and the corresponding natural parameters are proportional to the scaled mean vector σ-1μ and the precision matrix σ-1.

The expectation propagation algorithm. At each step of the iterative algorithm, we take the current approximating function g(θ) and pull out the approximating factor gi(θ), replacing it by the corresponding factor fi(θ) from the target distribution. We define the (unnormalized)

and the

We then construct an approximation to the tilted distribution, using a moment-matching approach described below. This approximation is the updated g(θ). We then back out the updated approximating factor, gi(θ) = g(θ)/g-i(θ). The result is that we have a new gi(θ) which approximates fi(θ), in the context of g-i. This also explains why the algorithm needs to iterate, as the context changes with each step until convergence.

Moment matching. The core of the expectation propagation algorithm occurs within each step, to construct the approximation of the tilted distribution, g-i(θ)fi(θ), within the parametric form specified for g(θ). The way this is done is by matching moments: that is, setting the expectations of the sufficient statistics of g to the corresponding expectations of θ in g-i(θ)fi(θ).

For example, if g(θ) has the form , then in the moment-matching step we set and .

The difficult part of this step is computing these expectations, which in theory could require an integration over a high-dimensional space (that of the entire parameter vector θ). In practical implementations of expectation propagation, these integrals can be done in closed form or via a transformation that reduces the problem to a low-dimensional integral. What makes this work is that the tilted distribution is mostly g-i (which is easy to handle because it follows a specified parametric form such as the multivariate normal) with only one difficult factor fi. For many models, fi can be expressed in such a way that its integral over g-i is well behaved.

If g is updated after each moment-matching step, the algorithm is called sequential EP, whereas if g is updated only after all tilted moments have been computed the algorithm is called parallel EP. Parallel EP is typically much faster as it requires less frequent updates of the higher-dimensional function g.

Moment matching corresponds to minimizing the Kullback-Leibler divergence from the tilted distribution to the new approximated marginal distribution, but the iterative matching of the marginals does not guarantee that the Kullback-Leibler divergence from the full posterior distribution to the overall approximation is minimized. There is no guarantee of convergence for EP, but for models with log-concave factors fi and initialization to the prior distribution, the algorithm has been used successfully in many applications.

Expectation propagation for logistic regression

Consider the model of independent data yi ~ Bin(mi, logit-1(Xiθ)), i = 1, …, n, with prior distribution θ ~ N(μ0, σ0). Here, Xi is the ith row of the n × k matrix X of predictors. It is not difficult to iteratively solve for the (k-dimensional) posterior mode of θ and then compute the second derivative matrix of the log posteror density, thus obtaining a mode-centered normal approximation (for details, see Section 16.2), but we can get a better normal approximation using expectation propagation, as follows.

We use a normal approximating function with factors gi(θ) = N(μi, σi), i = 0, …, n. We set g0 to equal the prior distribution and, to start, for each i = 1, …, n, set the natural parameter σi-1 μ to the zero vector and σi-1 to the identity matrix Ik times some positive number, corresponding to a starting distribution that is precise enough to be computationally stable but not so sharply localized that the algorithm is slow to move from its initial value.

The iteration proceeds by stepping through the data points. For each i:

1.  Compute the parameters of the cavity distribution, :

2.  Project the cavity distribution onto the one-dimensional subspace represented by the data vector Xi. The projected distribution is a one-dimensional normal with mean and variance,

Steps 1 and 2 can be combined so that only scalar moments XiσXiT and Xiμ are required:

3.  Define the (unnormalized) tilted distribution of η = Xiθ:

Compute moments 0, 1, 2 of this unnormalized distribution to get moments of the tilted distribution of η:

Compute M = E1/E0 and V = E2/E0 - (E1/E0)2, the mean and variance of the tilted distribution of η, using numerical integration. We use the iterative Gauss-Kronrod quadrature method. To perform these (one-dimensional) integrals, we need lower and upper bounds of integration. Ideally we would do this based on the mode and curvature of the tilted distribution but for simplicity we might just use M-i ± δV-i, based on the mean and standard deviation of the cavity distribution. The multiplier δ is set to some large number such as 10 to ensure that the mass of the tilted distribution is contained in the range of integration.

4.  Subtract off the cavity distribution to get the moments of the updated approximating factor gi(η) = N(ηMi, Vi):

5.  Transform these to get the moments of the updated approximating factor defined on the full space, gi(θ) = N(θμi, σi):

Recall that Mi and Vi are scalars, σi-1μi is a k × 1 vector, and σi-1 is a k × k matrix.

6.  Combine this updated gi with the cavity distribution g-i to get the updated approximating distribution, g(θ) = N(μ, σ). This is done by adding the natural parameters of the component parts:

This step is skipped in parallel EP, and only after updating all approximating factors gi(θ) the updated posterior is computed. With large n and k, this saves computation time.

7.  Now return to step 1, updating a new i.

What makes the algorithm computationally feasible is that, in each step, the relevant factor of the likelihood depends on the parameters only through the linear combination Xiθ. It is a different linear combination at each step, but during any particular step, the required integrals are one-dimensional.

In addition, the algorithm operates just as easily for any fixed normal prior distribution on θ, as this just folds into the factor g0. For a hierarchical model in which the model contains additional hyperparameters, another step is needed.

We illustrate expectation propagation for a simple logistic regression with uniform prior distribution.

Example. Bioassay logistic regression with two coefficients

Section 3.7 describes an experiment on 20 rats in four groups of 5, each group exposed to a different level of a toxin. For consistency with the notation immediately above, we write the model as yi ~ Bin(mi, logit-1(θ1 + θ2xi)), i = 1, …, 4. The row vector of data for observation i is then Xi = (1, xi). The model is completed with a flat prior distribution on θ = (θ1, θ2). As usual in such settings, this uniform prior distribution is not a reasonable summary of any scientific understanding of the problem but rather serves as a placeholder, with the understanding that it can be augmented with substantive information as needed.

The data are in Table 3.1 on page 74. In Section 4.1 we fit the basic mode-based normal approximation, yielding a point estimate of (0.8, 7.7) and a covariance matrix as shown in Figure 4.1 on page 86. For comparison, Figure 3.3 on page 76 displays the contours of the exact posterior density. The actual density is skewed, with a long tail toward large values of θ1 and θ2, and so we would hope that our approximation using expectation propagation would move in that direction, compared to the mode, to better fit the full distribution.

Figure 13.7  Progress of expectation propagation for a simple logistic regression with intercept and slope parameters. The bivariate normal approximating distribution is characterized by a mean and standard deviation in each dimension and a correlation. The algorithm reached approximate convergence after 4 iterations.

Figure 13.8  (a) Progress of the normal approximating distribution during the iterations of expectation propagation. The small ellipse at the bottom (which is actually a circle if x and y axes are placed on a common scale) is the starting distribution; after a few iterations the algorithm converges. (b) Comparison of the approximating distribution from EP (solid ellipse) to the simple approximation based on the curvature at the posterior mode (dotted ellipse) and the exact posterior density (dashed oval). The exact distribution is not normal so the EP approximation is not perfect, but it is closer than the mode-based approximation. All curves show contour lines for the density at 0.05 times the mode (which for the normal distribution contains approximately 95% of the probability mass; see discussion on page 85).

We then run the algorithm, with starting values σi-1 μi = 0 and σi-1 = I for i = 1, …, 4. During the progress of the iterations we keep track of the 2 × 1 vector σ-1μ and the 2 × 2 matrix σ-1; these are the natural parameters of the normal approximating distribution. To understand these better we reparameterize as μ1, μ2, σ1, σ2, ρ. Figure 13.7 shows the progress of these parameters over 10 iterations of the algorithm—a total of 40 steps—which is more than enough in this case for practical convergence. Figure 13.8 compares the final approximating distribution from expectation propagation to the simpler normal approximation based on the curvature at the posterior mode, and to the exact posterior density. In this example, EP performs well. The approximation is shifted toward the mass of the distribution, as we would hope.

In summary, expectation propagation is an appealing algorithm. It is fast and direct to implement. Unlike EM, it approximates the entire distribution rather than just supplying a point estimate; and, unlike usual implementations of variational Bayes, it fits the joint distribution rather than just the margins. Expectation propagation can be difficult to apply in general settings, however, as it requires a likelihood or prior factorization in which the required integrals can be expressed in some simple form. An active goal of research for all these deterministic approximate methods is to develop general implementations that can work with arbitrary density functions, as can now be done stochastically using Gibbs, Metropolis, and Hamiltonian Monte Carlo.

Extensions of expectation propagation

There is a provably convergent slower double-loop algorithm for EP, which can be combined with regular EP so that the slower algorithm is only used if regular EP does not converge.

Sometimes convergence of EP can be improved by using damping, that is, by making only partial updates of g after moment matching. Fractional EP (or power EP) is an extension of EP which can be used to improve stability when the approximation g is not flexible enough or when the propagation of information is difficult due to vague prior information. Fractional updating can be viewed as minimization of α-divergence which includes the directed Kullback-Leibler divergences and the Hellinger distance as special cases. Fractional EP provides flexibility in choice of minimized divergence and can also be used to improve convergence and to recover standard EP by setting α close to 1 in the final iterations. Improved marginal posteriors for θi can be obtained by applying expression (13.9) or faster approximations of that.

13.9   Other approximations

Integrated nested Laplace approximation (INLA)

Another form of posterior approximation involves partitioning the parameters into a large set γ conditional on a smaller set of hyperparameters . The idea is to construct a joint Gaussian approximation for p(γ, y) and apply expression (13.9) to approximate both p(y) and p(γi, y). Approximations to p(γiy) are obtained by numerically integrating over the low dimensional papprox(y) (hence the name integrated nested Laplace approximation). INLA works best when there are not many hyperparameters in the model, because then the space of hyperparameters is small enough that their marginal posterior distribution can be reasonably approximated by a sample on some discrete grid. The algorithm was developed for hierarchical models in which the parameters for the data model have a joint normal prior, so that the conditional normal approximation is easily constructed.

Central composite design integration (CCD)

If we like to improve over modal approximation, but the computation of p(y) is costly, we want to minimize the number of evaluation points around the mode. Clever deterministic placement of points can provide lower variance using the same number of posterior evaluations as sampling based approaches. Central composite design is a useful method for obtaining a moderate number of representative points from posteriors having moderate dimensionality. For example, a 5-dimensional model uses 27 integration points under this method, while a 15-dimensional model uses 287 points. CCD uses a fractional factorial design to avoid the exponential increase of the number of evaluation points when the dimensionality of the posterior increases while allowing to estimate the curvature of the posterior distribution around the mode. The integration is a finite sum with special weights. The accuracy of the CCD is between the modal approximation and the full integration with a grid or Monte Carlo. We use the CCD method in Chapter 21 for Gaussian processes.

Approximate Bayesian computation (ABC)

The term ‘approximate Bayesian computation’ is applied to a set of statistical procedures based on drawing parameters θ from an initial or approximate distribution, then sampling replicated data yrepθ from the model, and then accepting or rejecting the sample based on the closeness of yrep to the observed data y. The attraction of approximate Bayesian computation is that it does not require computation of the likelihood function, only the ability to simulate yrepθ from the data distribution; the difficulty is in the assessment of the closeness of yrep to y.

The most basic form of ABC has the form of simple rejection sampling:

•  Draw θ from the prior distribution p(θ) and then yrep from the data distribution, p(yrepθ), thus obtaining a single draw of yrep from its marginal distribution.

•  Compute a discrepancy measure d(yrep, y), where d is defined so that it is zero if y and yrep are identical and is larger the more ‘different’ they are, in some relevant dimensions.

•  Accept θ if d(yrep, y) < for some preset threshold , otherwise reject.

The result is to accept draws from the prior distribution in proportion to the probability that they yield replicated data that are close to the observed data. This latter probability is approximately the likelihood, hence the accepted set of simulation draws is an approximation of the posterior distribution.

ABC involves three challenges. First, one needs to define a discrepancy measure d, which ideally should capture the aspects of the data that are relevant for estimating the parameters in the model (that is, the sufficient statistics) without requiring yrep to match y on irrelevant ‘noise’ dimensions. Second, needs to be set small enough that the data provide information, but not so small that all (or almost all) the simulations get rejected. Third, if the prior distribution is broad enough, the rejection rate can be unacceptably high even if the discrepancy measure and threshold have been chosen well.

These challenges can be partly addressed by combining ABC with other ideas of posterior simulation. For example, it is not necessary to draw from the prior; one can draw simulations from another distribution and then correct using importance sampling. Or one can use MCMC steps to move in interesting regions of parameter space. As with many ideas in Bayesian simulation, research is stimulated by the practical challenges of approximating certain distributions that arise in practice.

A related idea is substitution likelihood, in which one uses a rank likelihood or a likelihood that only depends on quantiles and not what happens in between them, in place of a full likelihood specification. These almost-likelihoods are put in place of the likelihood in Bayes rule. The advantage of this approach is that it allows a specified joint distribution model (which is sometimes called a copula) to be applied in settings where the marginal distribution would not fit. This is thus a computational approximation that allows a popular class of statistical models to be applied more broadly.

13.10   Unknown normalizing factors

Finally, we discuss the application of numerical integration to compute normalizing factors, a problem that arises in some complicated models that we largely do not discuss in this book. We include this section here to introduce the problem, which is an active area of research; see the bibliographic note for some references on the topic.

Most of the models we present are based on combining standard classes of models for which the normalizing constants are known; for example, all the distributions in Appendix A have exactly known densities. Even the nonconjugate models we usually use are combinations of standard parts.

For standard models, we can compute p(θ) and p(yθ) exactly, or up to unknown multiplicative constants, and the expression

has a single unknown normalizing constant—the denominator of Bayes’ rule, p(y). A similar result holds for a hierarchical model with data y, local parameters γ, and hyperparameters . The joint posterior density has the form

which, once again, has only a single unknown normalizing constant. In each of these situations we can apply standard computational methods using the unnormalized density.

Unknown normalizing factors in the likelihood. A new and different problem arises when the sampling density p(yθ) has an unknown normalizing factor that depends on θ. Such models often arise in problems that are specified conditionally, such as in spatial statistics. For a simple example, pretend we knew that the univariate normal density was of the form p(yμ, σ) ∝ exp(-1/2σ2(y - μ)2), but with the normalizing factor 1/(√πσ) unknown. Performing our analysis as before without accounting for the factor of 1/σ would lead to an incorrect posterior distribution. (See Exercise 10.11 for a simple nontrivial example of an unnormalized density.)

In general we use the following notation:

where q is a generic notation for an unnormalized density, and

is called the normalizing factor of the family of distributions—being a function of θ, we can no longer call it a ‘constant’—and q(yθ) is a family of unnormalized densities. We consider the situation in which q(yθ) can be easily computed but z(θ) is unknown. Combining the density p(yθ) with a prior density, p(θ), yields the posterior density

To perform posterior inference, one must determine p(θy), as a function of θ, up to an arbitrary multiplicative constant.

An unknown, but constant, normalizing factor in the prior density, p(θ), causes no problems because it does not depend on any model parameters.

Unknown normalizing factors in hierarchical models. An analogous situation arises in hierarchical models if the population distribution has an unknown normalizing factor that depends on the hyperparameters. Consider a model with data y, first-level parameters γ, and hyperparameters . For simplicity assume that the likelihood, p(yγ), is known exactly but the population distribution is only known up to an unnormalized density, q(γ) = z()p(γ). The joint posterior density is then

and the function z() must be considered. If the likelihood, p(yγ), also has an unknown normalizing factor, it too must be considered in order to work with the posterior distribution.

Posterior computations involving an unknown normalizing factor

A basic computational strategy. If the integral (13.31), or the analogous expression for the hierarchical model, cannot be evaluated analytically, numerical integration can be used, perhaps involving more advanced approaches such as bridge and path sampling, discussed below. An additional difficulty is that one must evaluate (or estimate) the integral as a function of θ, or in the hierarchical case. The following basic strategy, combining analytic and simulation-based integration methods, can be used for computation with a posterior distribution containing unknown normalizing factors.

1.  Obtain an analytic estimate of z(θ) using some approximate method, for example Laplace’s method centered at a crude estimate of θ.

2.  Construct an approximation to the posterior distribution, as discussed in Chapter 13. Such approximations can often be integrated directly.

3.  For more exact computation, evaluate z(θ) (see below) whenever the posterior density needs to be computed for a new value of θ. Computationally, this approach treats z(θ) as an approximately ‘known’ function that happens to be expensive to compute.

Other strategies are possible in specific problems. If θ (or in the hierarchical version of the problem) is only one- or two-dimensional, it may be reasonable to compute z(θ) over a finite grid and interpolate to obtain an estimate of z(θ) as a function of θ. It is still recommended to perform the approximate steps 1 and 2 above so as to get a rough idea of the location of the posterior distribution—for any given problem, z(θ) needs not be computed in regions of θ for which the posterior probability is essentially zero.

Computing the normalizing factor. The normalizing factor can be computed, for each value of θ, using any of the numerical integration approaches applied to (13.31). Applying approximation methods such as Laplace’s is fairly straightforward, with the notation changed so that integration is over y, rather than θ, or changed appropriately to evaluate normalizing constants as a function of hyperparameters in a hierarchical model.

The importance sampling estimate is based on the identity

where Eg averages over y under the approximate density g(y). The estimate of z(θ) is , based on simulations ys from g(y). Again, estimation of a normalizing factor for a hierarchical model is analogous.

Some additional subtleties arise, however, when applying this method to evaluate z(θ) for many values of θ. First, we can use the same approximation function, g(y), and in fact the same simulations, y1, …, yS, to estimate z(θ) for different values of θ. Compared to performing a new simulation for each value of θ, using the same simulations saves computing time and increases accuracy (with the overall savings in time, we can simulate a larger number S of draws), but in general this can only be done in a local range of θ where the densities q(yθ) are similar enough to each other that they can be approximated by the same density. Second, we have some freedom in our computations because the evaluation of z(θ) as a function of θ is required only up to a proportionality constant. Any arbitrary constant that does not depend on θ becomes part of the constant in the posterior density and does not affect posterior inference. Thus, the approximate density, g(y), is not required to be normalized, as long as we use the same function g(y) to approximate q(yθ) for all values of θ, or if we know, or can estimate, the relative normalizing constants of the different approximation functions used in the problem.

Bridge and path sampling

When computing integrals numerically, we typically want to evaluate several of them (for example, when computing the marginal posterior densities of different models) or to compute them for a range of values of a continuous parameter (as with continuous model expansion or when working with models whose normalizing factors depend on the parameters in the model and cannot be determined analytically).

In these settings with a family of normalizing factors to be computed, importance sampling can be generalized in a number of useful ways. Continuing our notation above, we let be the continuous or discrete parameter indexing the family of densities p(γ, y). The numerical integration problem is to average over γ in this distribution, for each (or for a continuous range of values ). In general, for these methods it is only necessary to compute the densities p up to arbitrary normalizing constants.

One approach is to perform importance sampling using the density at some central value, p(γ*, y), as the approximating distribution for the entire range of . This approach is convenient as it does not require the creation of a special papprox but rather uses a distribution from a family that we already know how to handle (probably using Markov chain simulation).

If the distributions p(γ, y) are far enough apart that no single * can effectively cover all of them, we can move to bridge sampling, in which γ is sampled from two distributions, p(γ0, y) and p(γ1, y). Here, 0 and 1 represent two points near the end of the space of (think of the family of distributions as a suspension bridge held up at two points). The bridge sampling estimate of the integral for any is a weighted average of the importance sampling estimates given 0 and 1. The weights depend on and can be computed using a simple iterative formula.

Bridge sampling is a general idea that arises in many statistical contexts and can be further generalized to allow sampling from more than two points, which makes sense if the distributions vary widely over . In the limit in which a sample is drawn from the entire continuous range of distributions p(γ0, y) indexed by , we can apply path sampling, a differential form of bridge sampling. In path sampling, a sample (γ, ) is drawn from a joint posterior distribution, and the derivative of the log posterior density, d log p(γ, y)/d, is computed at the simulated values and numerically integrated over to obtain an estimate of the log marginal density, log p(y), over a continuous range of values of . This simulation-based computation uses the identity,

where . Numerically integrating these values gives an estimate of log p(y) (up to an additive constant) as a function of .

Bridge and path sampling are related to parallel tempering (see page 299), which uses a similar structure of samples from an indexed family of distributions. Depending on the application, the marginal distribution of can be specified for computational efficiency or convenience (as with tempering) or estimated (as with the computations of marginal densities).

13.11   Bibliographic note

An accessible source of general algorithms for conditional maximization (stepwise ascent), Newton’s method, and other computational methods is Press et al. (1986). Gill, Murray, and Wright (1981) is a classic book that is useful for understanding more complicated optimization problems.

The boundary-avoiding prior densities in Section 13.2 are discussed by Chung, Rabe-Hesketh, et al. (2013a,b).

Laplace’s method for integration was developed in a statistical context by Tierney and Kadane (1986), who demonstrated the accuracy of applying the method separately to the numerator and denominator of (13.3). Extensions and refinements were made by Kass, Tierney, and Kadane (1989) and Wong and Li (1992). Geweke (1989) discusses modal approximations for importance sampling and proposes the k-variate split normal density as an improved approximation for asymmetric posterior densities.

The EM algorithm was first presented in full generality and under that name, along with many examples, by Dempster, Laird, and Rubin (1977); the formulation in that article is in terms of finding the maximum likelihood estimate, but, as the authors note, the same arguments hold for finding posterior modes. That article and the accompanying discussion contributions also refer to many earlier implementations in specific problems; see also Meng and Pedlow (1992). EM was first presented in a general statistical context by Orchard and Woodbury (1972) as the ‘missing information principle’ and first derived in mathematical generality by Baum et al. (1970). Little and Rubin (2002, Chapter 8) discuss the EM algorithm for missing data problems. SEM was introduced in Meng and Rubin (1991); ECM in Meng and Rubin (1993) and Meng (1994a); SECM in van Dyk, Meng, and Rubin (1995); and ECME in Liu and Rubin (1994). AECM appears in Meng and van Dyk (1997), and the accompanying discussion provides further connections. Many of the iterative simulation methods discussed in Chapter 11 for simulating posterior distributions can be regarded as stochastic extensions of EM; Tanner and Wong (1987) is an important paper in drawing this connection. Parameter-expanded EM was introduced by Liu, Rubin, and Wu (1998), and related ideas appear in Meng and van Dyk (1997), Liu and Wu (1999), and Liu (2003).

Some references on variational Bayes include Jordan et al. (1999), Jaakkola and Jordan (2000), Blei, Ng, and Jordan (2003), and Gershman, Hoffman, and Blei (2012). Hoffman et al. (2012) present a stochastic variational algorithm that is computable for large datasets.

Expectation propagation comes from Minka (2001). This and other deterministic approximate Bayesian methods are reviewed by Bishop (2006) and Rasmussen and Williams (2006). Cseke and Heskes (2011) consider several methods to improve marginal posteriors obtained from Laplace’s method or expectation propagation. Rue, Martino, and Chopin (2009) describe integrated nested Laplace approximation and CCD integration scheme, with more information at Rue (2013). Heskes et al. and Marin et al. (2012) review approximate Bayesian computation. Some work on substitution likelihoods appears in Dunson and Taylor (2005), Hoff (2007), and Murray et al. (2013).

Bridge sampling was introduced by Meng and Wong (1996). Gelman and Meng (1998) generalize from bridge sampling to path sampling and provide references to related work that has appeared in the statistical physics literature. Meng and Schilling (1996) provide an example in which several of these methods are applied to a problem in factor analysis. Kong et al. (2003) set up a general theoretical framework that includes importance sampling and bridge sampling as special cases.

The method of computing normalizing constants for statistical problems using importance sampling has been applied by Ott (1979) and others. Models with unknown normalizing functions arise often in spatial statistics; see, for example, Besag (1974) and Ripley (1981, 1988). Geyer (1991) and Geyer and Thompson (1992, 1993) develop the idea of estimating the normalizing function using simulations from the model and have applied these methods to problems in genetics. Pettitt, Friel, and Reeves (2003) use path sampling to estimate normalizing constants for a class of models in spatial statistics. Computing normalizing functions is an area of active current research, as more and more complicated Bayesian models are coming into use.

13.12   Exercises

1.  Multimodality: Consider a simple one-parameter model of independent data, yi ~ Cauchy(θ, 1), i = 1, …, n, with uniform prior density on θ. Suppose n = 2.

(a)   Prove that the posterior distribution is proper.

(b)   Under what conditions will the posterior density be unimodal?

2.  Normal approximation and importance resampling:

(a)   Repeat Exercise 3.12 using the normal approximation to produce posterior simulations for (α, β).

(b)   Use importance resampling to improve on the normal approximation.

(c)   Compute the importance ratios for your simulations. Plot a histogram of the importance ratios and comment on their distribution. Compute an estimate of effective sample size using (10.4) on page 266.

3.  Mode-based approximation: Consider the model, yj ~ Binomial(nj, θj), where θj = logit-1 (α + βxj), for j = 1, …, J, and with independent prior distributions, α ~ t4(0, 22) and θ ~ t4(0, 1). Suppose J = 10, the xj values are randomly drawn from a U(0, 1) distribution, and nj ~ Poisson+(5), where Poisson+ is the Poisson distribution restricted to positive values.

(a)   Sample a dataset at random from the model

(b)   Use rejection sampling to get 1000 independent posterior draws from (α, β).

(c)   Approximate the posterior density for (α, β) by a normal centered at the posterior mode with covariance matrix fit to the curvature at the mode.

(d)   Take 1000 draws from the two-dimensional t4 distribution with that center and scale matrix and use importance sampling to estimate Ey) and E(βy).

4.  Analytic approximation to a subset of the parameters: suppose that the joint posterior distribution p(θ1, θ2y) is of interest and that it is known that the t provides an adequate approximation to the conditional distribution, p(θ1θ2, y). Show that both the normal and t approaches described in the last paragraph of Section 13.5 lead to the same answer.

5.  Estimating the number of unseen species (see Fisher, Corbet, and Williams, 1943, Efron and Thisted, 1976, and Seber, 1992): suppose that during an animal trapping expedition the number of times an animal from species i is caught is xi ~ Poisson(λi). For parts (a)—(d) of this problem, assume a Gamma(α, β) prior distribution for the λi’s, with a uniform hyperprior distribution on (α, β). The only observed data are yk, the number of species observed exactly k times during a trapping expedition, for k = 1, 2, 3, …

(a)   Write the distribution p(xiα, β).

(b)   Use the distribution of xi to derive a multinomial distribution for y given that there are a total of N species.

(C)   Suppose that we are given y = (118, 74, 44, 24, 29, 22, 20, 14, 20, 15, 12, 14, 6, 12, 6, 9, 9, 6, 10, 10, 11, 5, 3, 3), so that 118 species were observed only once, 74 species were observed twice, and so forth, with a total of 496 species observed and 3266 animals caught. Write down the likelihood for y using the multinomial distribution with 24 cells (ignoring unseen species). Use any method to find the mode of α, β and an approximate second derivative matrix.

(d)   Derive an estimate and approximate 95% posterior interval for the number of additional species that would be observed if 10,000 more animals were caught.

(e)   Evaluate the fit of the model to the data using appropriate posterior predictive checks.

(f)   Discuss the sensitivity of the inference in (d) to each of the model assumptions.

6.  Derivation of the monotone convergence of EM algorithm: prove that the function Eold log p(γ, y) in (13.5) is maximized at = old. (Hint: express the expectation as an integral and apply Jensen’s inequality to the convex logarithm function.)

7.  Conditional maximization for the hierarchical normal model: show that the conditional modes of σ and τ associated with (11.14) and (11.16), respectively, are correct.

8.  Joint posterior modes for hierarchical models:

(a)   Show that the posterior density for the coagulation example from Table 11.2 on page 288 has a degenerate mode at τ = 0 and θj = μ for all j.

(b)   The rest of this exercise demonstrates that the degenerate mode represents a small part of the posterior distribution. First estimate an upper bound on the integral of the unnormalized posterior density in the neighborhood of the degenerate mode. (Approximate the integrand so that the integral is analytically tractable.)

(c)   Now approximate the integral of the unnormalized posterior density in the neighborhood of the other mode using the density at the mode and the second derivative matrix of the log posterior density at the mode.

(d)   Finally, estimate an upper bound on the posterior mass in the neighborhood of the degenerate mode.

9.  EM algorithm:

(a)   For the hierarchical normal model in Section 13.6, derive the expressions (13.14) for μnew, σnew, and τnew.

(b)   Pick values for the hyperparameters in this model, then simulate fake data, then apply EM to estimate the model. Compare the EM estimate to the assumed true model.

10.  Variational Bayes: Consider probit regression, which is just like logistic except that the function logit-1 is replaced by the normal cumulative distribution function. Set up and program variational Bayes for a probit regression with two coefficients (that is, Pr(yi = 1) = (a + bxi), for i = 1, …, n), using the latent-data formulation (so that zi ~ N(a + bxi, 1) and yi = 1 if zi > 0 and 0 otherwise):

(a)   Write the log posterior density (up to an arbitrary constant), p(a, b, zy).

(b)   Assuming a variational approximation g that is independent in its n + 2 dimensions, determine the functional form of each of the factors in g.

(c)   Write the steps of the variational Bayes algorithm and program them in R.

11.  Unknown normalizing functions: compute the normalizing factor for the following unnormalized sampling density,

as a function of A, B, C. (Hint: it will help to integrate out analytically as many of the parameters as you can.)

1 In the literature on this algorithm, the approximating function is typically called q. We use g for consistency with our general notation in which g is an approximating density, while we reserve q for unnormalized densities, as in Chapter 10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset