Appendix B

Outline of proofs of limit theorems

The basic result of large-sample Bayesian inference is that as more and more data arrive, the posterior distribution of the parameter vector approaches multivariate normal. If the likelihood model happens to be correct, then we can also prove that the limiting posterior distribution is centered at the true value of the parameter vector. In this appendix, we outline a proof of the main results. The practical relevance of the theorems is discussed in Chapter 4.

We derive the limiting posterior distribution in three steps. The first step is the convergence of the posterior distribution to a point, for a discrete parameter space. If the data truly come from the hypothesized family of probability models, the point of convergence will be the true value of the parameter. The second step applies the discrete result to regions in continuous parameter space, to show that the mass of the continuous posterior distribution becomes concentrated in smaller and smaller neighborhoods of a particular value of parameter space. Finally, the third step of the proof shows the accuracy of the normal approximation in the vicinity of the posterior mode.

Mathematical framework

The key assumption for the results presented here is that data are independent and identically distributed: we label the data as y = (y1, …, yn), with probability density . We use the notation f(·) for the true distribution of the data, in contrast to pθ), the distribution of our probability model. The data y may be discrete or continuous.

We are interested in a (possibly vector) parameter θ, defined on a space θ, for which we have a prior distribution, p(θ), and a likelihood, , which assumes the data are independent and identically distributed. As illustrated in the counterexamples discussed in Section 4.3, some conditions are required on the prior distribution and the likelihood, as well as on the space θ, for the theorems to hold.

It is necessary to assume a true distribution for y, because the theorems only hold in probability; for almost every problem, it is possible to construct data sequences y for which the posterior distribution of θ will not have the desired limit. The theorems are of the form, ‘The posterior distribution of θ converges in probability (as n → ∞) to …’; the ‘probability’ is with respect to f(y), the true distribution of y.

We label θ0 as the value of θ that minimizes the Kullback-Leibler divergence KL(θ) of the distribution pθ) in the model relative to the true distribution, f(·). The Kullback-Leibler divergence is defined at any value θ by

This is a measure of ‘discrepancy’ between the model distribution p(yiθ) and the true distribution f(y), and θ0 may be thought of as the value of θ that minimizes this distance. We assume that θ0 is the unique minimizer of KL(θ). It turns out that as n increases, the posterior distribution p(θy) becomes concentrated about θ0.

Suppose that the likelihood model is correct; that is, there is some true parameter value θ for which . In this case, it is easily shown via Jensen’s inequality that (B.1) is minimized at the true parameter value, which we can then label as θ0 without risk of confusion.

Convergence of the posterior distribution for a discrete parameter space

Theorem. If the parameter space θ is finite and Pr(θ = θ0) > 0, then Pr is the value of θ that minimizes the Kullback-Leibler divergence (B.1).

Proof. We will show that p(θy) → 0 as n → ∞ for all θθ0. Consider the log posterior odds relative to θ0:

The second term on the right is a sum of n independent identically distributed random variables, if θ and θ0 are considered fixed and the yi’s are random with distributions f. Each term in the summation has a mean of

which is zero if θ = θ0 and negative otherwise, as long as θ0 is the unique minimizer of KL(θ).

Thus, if θθ0, the second term on the right of (B.2) is the sum of n independent identically distributed random variables with negative mean. By the law of large numbers, the sum approaches -∞ as n → ∞. As long as the first term on the right of (B.2) is finite (that is, as long as p(θ0) > 0), the whole expression approaches -∞ in the limit. Then, . Since all probabilities sum to 1, p(θy) → 1.

Convergence of the posterior distribution for a continuous parameter space

If θ has a continuous distribution, then p(θ0y) is always zero for any finite sample, and so the above theorem cannot apply. We can, however, show that the posterior probability distribution of θ becomes more and more concentrated about θ0 as n → ∞. Define a neighborhood of θ0 as the open set of all points in θ within a fixed nonzero distance of θ0.

Theorem. If θ is defined on a compact set and A is a neighborhood of θ0 with nonzero prior probability, then Pr(θ Ay) → 1 as n → ∞, where θ0 is the value of θ that minimizes (B.1).

Proof. The theorem can be proved by placing a small neighborhood about each point in θ, with A being the only neighborhood that includes θ0, and then covering θ with a finite subset of these neighborhoods. If θ is compact, such a finite subcovering can always be obtained. The proof of the convergence of the posterior distribution to a point is then adapted to show that the posterior probability for any neighborhood except A approaches zero as n → ∞, and thus Pr(θ Ay) → 1.

Convergence of the posterior distribution to normality

We just showed that by increasing n, we can put as much of the mass of the posterior distribution as we like in any arbitrary neighborhood of θ0. Obtaining the limiting posterior distribution requires two more steps. The first is to show that the posterior mode is consistent; that is, that the mode of the posterior distribution falls within the neighborhood where almost all the mass lies. The second step is a normal approximation centered at the posterior mode.

Theorem. Under some regularity conditions (notably that θ0 not be on the boundary of θ), as n → ∞, the posterior distribution of θ approaches normality with mean θ0 and variance (nJ(θ0))-1, where θ0 is the value that minimizes the Kullback-Leibler divergence (B.1) and J is the Fisher information (2.20).

Proof. For convenience in exposition, we first derive the result for a scalar θ. Define as the posterior mode. The proof of the consistency of the maximum likelihood estimate (see the bibliographic note at the end of the chapter) can be mimicked to show that is also consistent; that is θ0 as n → ∞.

Given the consistency of the posterior mode, we approximate the log posterior density by a Taylor expansion centered about , confident that (for large n) the neighborhood near has almost all the mass in the posterior distribution. The normal approximation for θ is a quadratic approximation for the log posterior distribution of θ, a form that we derive via a Taylor series expansion of log p(θy) centered at :

(The linear term in the expansion is zero because the log posterior density has zero derivative at its interior mode.)

Consider the above equation as a function of θ. The first term is a constant. The coefficient for the second term can be written as

which is a constant plus the sum of n independent identically distributed random variables with negative mean (once again, it is the yi’s that are considered random here). If f(y) ≡ p(yθ0) for some θ0, then the terms each have mean - J(θ0). If the true data distribution f(y) is not in the model class, then the mean is evaluated at θ = θ0, which is the negative second derivative of the Kullback-Leibler divergence, KL(θ0), and is thus negative, because θ0 is defined as the point at which KL(θ) is minimized. Thus, the coefficient for the second term in the Taylor expansion increases with order n. A similar argument shows that coefficients for the third- and higher-order terms increase no faster than order n.

We can now prove that the posterior distribution approaches normality. As n → ∞, the mass of the posterior distribution p(θy) becomes concentrated in smaller and smaller neighborhoods of θ0, and the distance also approaches zero. Thus, in considering the Taylor expansion about the posterior mode, we can focus on smaller and smaller neighborhoods about . As , the third-order and succeeding terms of the Taylor expansion fade in importance, relative to the quadratic term, so that the distance between the quadratic approximation and the log posterior distribution approaches 0, and the normal approximation becomes increasingly accurate.

Multivariate form

If θ is a vector, the Taylor expansion becomes

where the second derivative of the log posterior distribution is now a matrix whose expectation is the negative of a positive definite matrix which is the Fisher information matrix (2.20) if f(y) ≡ p(yθ0) for some θ0.

B.1    Bibliographic note

The asymptotic normality of the posterior distribution was known by Laplace (1810) but first proved rigorously by Le Cam (1953); a general survey of previous and subsequent theoretical results in this area is given by Le Cam and Yang (1990). Like the central limit theorem for sums of random variables, the consistency and asymptotic normality of the posterior distribution also hold in far more general conditions than independent and identically distributed data. The key condition is that there be ‘replication’ at some level, as, for example, if the data come in a time series whose correlations decay to zero.

The Kullback-Leibler divergence comes from Kullback and Leibler (1951). Chernoff (1972, Sections 6 and 9.4) has a clear presentation of consistency and limiting normality results for the maximum likelihood estimate. Both proofs can be adapted to the posterior distribution. DeGroot (1970, Chapter 10) derives the asymptotic distribution for the posterior distribution in more detail; Shen and Wasserman (2001) provide more recent results in this area.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset