Chapter 3

Introduction to multiparameter models

Virtually every practical problem in statistics involves more than one unknown or unobservable quantity. It is in dealing with such problems that the simple conceptual framework of the Bayesian approach reveals its principal advantages over other methods of inference. Although a problem can include several parameters of interest, conclusions will often be drawn about one, or only a few, parameters at a time. In this case, the ultimate aim of a Bayesian analysis is to obtain the marginal posterior distribution of the particular parameters of interest. In principle, the route to achieving this aim is clear: we first require the joint posterior distribution of all unknowns, and then we integrate this distribution over the unknowns that are not of immediate interest to obtain the desired marginal distribution. Or equivalently, using simulation, we draw samples from the joint posterior distribution and then look at the parameters of interest and ignore the values of the other unknowns. In many problems there is no interest in making inferences about many of the unknown parameters, although they are required in order to construct a realistic model. Parameters of this kind are often called nuisance parameters. A classic example is the scale of the random errors in a measurement problem.

We begin this chapter with a general treatment of nuisance parameters and then cover the normal distribution with unknown mean and variance in Section 3.2. Sections 3.4 and 3.5 present inference for the multinomial and multivariate normal distributions—the simplest models for discrete and continuous multivariate data, respectively. The chapter concludes with an analysis of a nonconjugate logistic regression model, using numerical computation of the posterior density on a grid.

3.1   Averaging over ‘nuisance parameters’

To express the ideas of joint and marginal posterior distributions mathematically, suppose θ has two parts, each of which can be a vector, θ = (θ1, θ2), and further suppose that we are only interested (at least for the moment) in inference for θ1, so θ2 may be considered a ‘nuisance’ parameter. For instance, in the simple example,

in which both are unknown, interest commonly centers on μ.

We seek the conditional distribution of the parameter of interest given the observed data; in this case, . This is derived from the joint posterior density,

by averaging over θ2:

Alternatively, the joint posterior density can be factored to yield

which shows that the posterior distribution of interest, , is a mixture of the conditional posterior distributions given the nuisance parameter, is a weighting function for the different possible values of θ2. The weights depend on the posterior density of θ2 and thus on a combination of evidence from data and prior model. The averaging over nuisance parameters θ2 can be interpreted generally; for example, θ2 can include a discrete component representing different possible sub-models.

We rarely evaluate the integral (3.1) explicitly, but it suggests an important practical strategy for both constructing and computing with multiparameter models. Posterior distributions can be computed by marginal and conditional simulation, first drawing θ2 from its marginal posterior distribution and then θ1 from its conditional posterior distribution, given the drawn value of θ2. In this way the integration embodied in (3.1) is performed indirectly. A canonical example of this form of analysis is provided by the normal model with unknown mean and variance, to which we now turn.

3.2   Normal data with a noninformative prior distribution

As the prototype example of estimating the mean of a population from a sample, we consider a vector y of n independent observations from a univariate normal distribution, N(μ, σ2); the generalization to the multivariate normal distribution appears in Section 3.5. We begin by analyzing the model under a noninformative prior distribution, with the understanding that this is no more than a convenient assumption for the purposes of exposition and is easily extended to informative prior distributions.

A noninformative prior distribution

We saw in Chapter 2 that a sensible vague prior density for μ and σ, assuming prior independence of location and scale parameters, is uniform on or, equivalently,

The joint posterior distribution, p(μ, σ2y)

Under this conventional improper prior density, the joint posterior distribution is proportional to the likelihood function multiplied by the factor 1/σ2:

where

is the sample variance of the yi’s. The sufficient statistics are and s2.

The conditional posterior distribution,

In order to factor the joint posterior density as in (3.1), we consider first the conditional posterior density, p(μσ2, y), and then the marginal posterior density, p(σ2y). To determine the posterior distribution of μ, given σ2, we simply use the result derived in Section 2.5 for the mean of a normal distribution with known variance and a uniform prior distribution:

The marginal posterior distribution,

To determine , we must average the joint distribution (3.2) over μ:

Integrating this expression over μ requires evaluating the integral exp , which is a simple normal integral; thus,

which is a scaled inverse-2 density:

We have thus factored the joint posterior density (3.2) as the product of conditional and marginal posterior densities:.

This marginal posterior distribution for σ2 has a remarkable similarity to the analogous sampling theory result: conditional on σ2 (and μ), the distribution of the appropriately scaled sufficient statistic, . Considering our derivation of the reference prior distribution for the scale parameter in Section 2.8, however, this result is not surprising.

Sampling from the joint posterior distribution

It is easy to draw samples from the joint posterior distribution: first draw σ2 from (3.5), then draw μ from (3.3). We also derive some analytical results for the posterior distribution, since this is one of the few multiparameter problems simple enough to solve in closed form.

Analytic form of the marginal posterior distribution of μ

The population mean, μ, is typically the estimand of interest, and so the objective of the Bayesian analysis is the marginal posterior distribution of μ, which can be obtained by integrating σ2 out of the joint posterior distribution. The representation (3.1) shows that the posterior distribution of μ can be regarded as a mixture of normal distributions, mixed over the scaled inverse-2 distribution for the variance, σ2. We can derive the marginal posterior density for μ by integrating the joint posterior density over σ2:

This integral can be evaluated using the substitution

and recognizing that the result is an unnormalized gamma integral:

This is the density (see Appendix A).

To put it another way, we have shown that, under the noninformative uniform prior distribution on (μ log σ), the posterior distribution of μ has the form

where tn-1 denotes the standard t density (location 0, scale 1) with n-1 degrees of freedom. This marginal posterior distribution provides another interesting comparison with sampling theory. Under the sampling distribution, p(yμ, σ2), the following relation holds:

The sampling distribution of the pivotal quantity ( - μ)/(s/√n) does not depend on the nuisance parameter σ2, and its posterior distribution does not depend on data. In general, a pivotal quantity for the estimand is defined as a nontrivial function of the data and the estimand whose sampling distribution is independent of all parameters and data.

Posterior predictive distribution for a future observation

The posterior predictive distribution for a future observation, , can be written as a mixture, . The first of the two factors in the integral is just the normal distribution for the future observation given the values of (μ, σ2), and does not depend on y at all. To draw from the posterior predictive distribution, first draw μ, σ2 from their joint posterior distribution and then simulate ~ N(μ, σ2).

In fact, the posterior predictive distribution of is a t distribution with location , scale (1 + 1/n)1/2 s, and n - 1 degrees of freedom. This analytic form is obtained using the same techniques as in the derivation of the posterior distribution of μ. Specifically, the distribution can be obtained by integrating out the parameters μ, σ2 according to their joint posterior distribution. We can identify the result more easily by noticing that the factorization leads to , which is the same, up to a changed scale factor, as the distribution of μσ2, y.

Example. Estimating the speed of light

Simon Newcomb set up an experiment in 1882 to measure the speed of light. Newcomb measured the amount of time required for light to travel a distance of 7442 meters. A histogram of Newcomb’s 66 measurements is shown in Figure 3.1. There are two unusually low measurements and then a cluster of measurements that are approximately symmetrically distributed. We (inappropriately) apply the normal model, assuming that all 66 measurements are independent draws from a normal distribution with mean μ and variance σ2. The main substantive goal is posterior inference for μ. The outlying measurements do not fit the normal model; we discuss Bayesian methods for measuring the lack of fit for these data in Section 6.3. The mean of the 66 measurements is = 26.2, and the sample standard deviation is s = 10.8. Assuming the noninformative prior distribution p(μ, σ2) ∝ (σ2)-1, a 95% central posterior interval for μ is obtained from the t65 marginal posterior distribution of μ as ± 1.997s/√66 = [23.6, 28.8].

The posterior interval can also be obtained by simulation. Following the factorization of the posterior distribution given by (3.5) and (3.3), we first draw a random value of σ2 ~ Inv-2(65, s2) as 65s2 divided by a random draw from the -652 distribution (see Appendix A). Then given this value of σ2, we draw μ from its conditional posterior distribution, N(26.2, σ2/66). Based on 1000 simulated values of (μ, σ2), we estimate the posterior median of μ to be 26.2 and a 95% central posterior interval for μ to be [23.6, 28.9], close to the analytically calculated interval.

Incidentally, based on the currently accepted value of the speed of light, the ‘true value’ for μ in Newcomb’s experiment is 33.0, which falls outside our 95% interval. This reinforces the fact that posterior inferences are only as good as the model and the experiment that produced the data.

Figure 3.1  Histogram of Simon Newcomb’s measurements for estimating the speed of light, from Stigler (1977). The data are recorded as deviations from 24,800 nanoseconds.

3.3   Normal data with a conjugate prior distribution

A family of conjugate prior distributions

A first step toward a more general model is to assume a conjugate prior distribution for the two-parameter univariate normal sampling model in place of the noninformative prior distribution just considered. The form of the likelihood displayed in (3.2) and the subsequent discussion shows that the conjugate prior density must also have the product form , where the marginal distribution of σ2 is scaled inverse-2 and the conditional distribution of μ given σ2 is normal (so that marginally μ has a t distribution). A convenient parameterization is given by the following specification:

which corresponds to the joint prior density

We label this the N-Inv-2(μ0, σ020; ν0, σ02) density; its four parameters can be identified as the location and scale of μ and the degrees of freedom and scale of σ2, respectively.

The appearance of σ2 in the conditional distribution of μσ2 means that μ and σ2 are necessarily dependent in their joint conjugate prior density: for example, if σ2 is large, then a high-variance prior distribution is induced on μ. This dependence is notable, considering that conjugate prior distributions are used largely for convenience. Upon reflection, however, it often makes sense for the prior variance of the mean to be tied to σ2, which is the sampling variance of the observation y. In this way, prior belief about μ, is calibrated by the scale of measurement of y and is equivalent to κ0 prior measurements on this scale.

The joint posterior distribution, p(μ, σ2y)

Multiplying the prior density (3.6) by the normal likelihood yields the posterior density

where, after some algebra (see Exercise 3.9), it can be shown that

The parameters of the posterior distribution combine the prior information and the information contained in the data. For example μn is a weighted average of the prior mean and the sample mean, with weights determined by the relative precision of the two pieces of information. The posterior degrees of freedom, νn, is the prior degrees of freedom plus the sample size. The posterior sum of squares, νnσn2, combines the prior sum of squares, the sample sum of squares, and the additional uncertainty conveyed by the difference between the sample mean and the prior mean.

The conditional posterior distribution, p

The conditional posterior density of μ, given σ2, is proportional to the joint posterior density (3.7) with σ2 held constant,

which agrees, as it must, with the analysis in Section 2.5 of μ with σ considered fixed.

The marginal posterior distribution, p

The marginal posterior density of σ2, from (3.7), is scaled inverse-2:

Sampling from the joint posterior distribution

To sample from the joint posterior distribution, just as in the previous section, we first draw σ2 from its marginal posterior distribution (3.9), then draw μ from its normal conditional posterior distribution (3.8), using the simulated value of σ2.

Analytic form of the marginal posterior distribution of μ

Integration of the joint posterior density with respect to σ2, in a precisely analogous way to that used in the previous section, shows that the marginal posterior density for μ is

3.4   Multinomial model for categorical data

The binomial distribution that was emphasized in Chapter 2 can be generalized to allow more than two possible outcomes. The multinomial sampling distribution is used to describe data for which each observation is one of k possible outcomes. If y is the vector of counts of the number of observations of each outcome, then

where the sum of the probabilities, , is 1. The distribution is typically thought of as implicitly conditioning on the number of observations, . The conjugate prior distribution is a multivariate generalization of the beta distribution known as the Dirichlet,

where the distribution is restricted to nonnegative θj’s with see Appendix A for details. The resulting posterior distribution for the .

The prior distribution is mathematically equivalent to a likelihood resulting from observations with αj observations of the jth outcome category. As in the binomial there are several plausible noninformative Dirichlet prior distributions. A uniform density is obtained by setting αj = 1 for all j; this distribution assigns equal density to any vector results in an improper prior distribution that is uniform in the log(θj)’s. The resulting posterior distribution is proper if there is at least one observation in each of the k categories, so that each component of y is positive. The bibliographic note at the end of this chapter points to other suggested noninformative prior distributions for the multinomial model.

Example. Pre-election polling

For a simple example of a multinomial model, we consider a sample survey question with three possible responses. In late October, 1988, a survey was conducted by CBS News of 1447 adults in the United States to find out their preferences in the upcoming presidential election. Out of 1447 persons, y1 = 727 supported George Bush, y2 = 583 supported Michael Dukakis, and y3 = 137 supported other candidates or expressed no opinion. Assuming no other information on the respondents, the 1447 observations are exchangeable. If we also assume simple random sampling (that is, 1447 names ‘drawn out of a hat’), then the data (y1, y2, y3) follow a multinomial distribution, with parameters (θ1, θ2, θ3), the proportions of Bush supporters, Dukakis supporters, and those with no opinion in the survey population. An estimand of interest is θ1 - θ2, the population difference in support for the two major candidates.

With a noninformative uniform prior distribution on θ, α1 = α2 = α3 = 1, the posterior distribution for (θ1, θ2, θ3) is Dirichlet(728, 584,138). We could compute the posterior distribution of θ1 - θ2 by integration, but it is simpler just to draw 1000 points (θ1, θ2, θ3) from the posterior Dirichlet distribution and then compute θ1 - θ2 for each. The result is displayed in Figure 3.2. All of the 1000 simulations had θ1 > θ2; thus, the estimated posterior probability that Bush had more support than Dukakis in the survey population is over 99.9%.

In fact, the CBS survey does not use independent random sampling but rather uses a variant of a stratified sampling plan. We discuss an improved analysis of this survey, using some knowledge of the sampling scheme, in Section 8.3 (see Table 8.2 on page 207).

Figure 3.2  Histogram of values of (θ1 - θ2) for 1000 simulations from the posterior distribution for the election polling example.

In complicated problems—for example, analyzing the results of many survey questions simultaneously—the number of multinomial categories, and thus parameters, becomes so large that it is hard to usefully analyze a dataset of moderate size without additional structure in the model. Formally, additional information can enter the analysis through the prior distribution or the sampling model. An informative prior distribution might be used to improve inference in complicated problems, using the ideas of hierarchical modeling introduced in Chapter 5. Alternatively, loglinear models can be used to impose structure on multinomial parameters that result from cross-classifying several survey questions; Section 16.7 provides details and an example.

3.5   Multivariate normal model with known variance

Here we give a somewhat formal account of the distributional results of Bayesian inference for the parameters of a multivariate normal distribution. In many ways, these results parallel those already given for the univariate normal model, but there are some important new aspects that play a major role in the analysis of linear models, which is the central activity of much applied statistical work (see Chapters 5, 14, and 15). This section can be viewed at this point as reference material for future chapters.

Multivariate normal likelihood

The basic model to be discussed concerns an observable vector y of d components, with the multivariate normal distribution,

where μ is a (column) vector of length d and σ is a d × d variance matrix, which is symmetric and positive definite. The likelihood function for a single observation is

and for a sample of n independent and identically distributed observations, y1, …, yn, is

where S0 is the matrix of ‘sums of squares’ relative to μ,

Conjugate analysis

As with the univariate normal model, we analyze the multivariate normal model by first considering the case of known .

Conjugate prior distribution for μ with known . The log-likelihood is a quadratic form in μ, and therefore the conjugate prior distribution for μ is the multivariate normal distribution, which we parameterize as .

Posterior distribution for μ with known . The posterior distribution of μ is

which is an exponential of a quadratic form in μ. Completing the quadratic form and pulling out constant factors (see Exercise 3.13) gives

where

These are similar to the results for the univariate normal model in Section 2.5, the posterior mean being a weighted average of the data and the prior mean, with weights given by the data and prior precision matrices, n-1 and Λ0-1, respectively. The posterior precision is the sum of the prior and data precisions.

Posterior conditional and marginal distributions of subvectors of μ with known . It follows from the properties of the multivariate normal distribution (see Appendix A) that the marginal posterior distribution of a subset of the parameters, μ(1) say, is also multivariate normal, with mean vector equal to the appropriate subvector of the posterior mean vector μn and variance matrix equal to the appropriate submatrix of Λn. Also, the conditional posterior distribution of a subset μ(1) given the values of a second subset μ(2) is multivariate normal. If we write superscripts in parentheses to indicate appropriate subvectors and submatrices, then

where the regression coefficients β12 and conditional variance matrix Λ12 are defined by

Posterior predictive distribution for new data. We now work out the analytic form of the posterior predictive distribution for a new observation ~ N(μ, ). As with the univariate normal, we first note that the joint distribution, have a joint normal posterior distribution, and so the marginal posterior distribution of is (multivariate) normal. We are still assuming the variance matrix σ is known. As in the univariate case, we can determine the posterior mean and variance of using (2.7) and (2.8):

and

To sample from the posterior distribution or the posterior predictive distribution, refer to Appendix A for a method of generating random draws from a multivariate normal distribution with specified mean and variance matrix.

Noninformative prior density for μ. A noninformative uniform prior density for μ is p(μ) ∝ constant, obtained in the limit as the prior precision tends to zero in the sense Λ0-1 → 0; in the limit of infinite prior variance (zero prior precision), the prior mean is irrelevant. The posterior density is then proportional to the likelihood (3.11). This is a proper posterior distribution only if nd, that is, if the sample size is greater than or equal to the dimension of the multivariate normal; otherwise the matrix S0 is not full rank. If nd, the posterior distribution for μ, given the uniform prior density, is .

3.6   Multivariate normal with unknown mean and variance

Conjugate inverse-Wishart family of prior distributions

Recall that the conjugate distribution for the univariate normal with unknown mean and variance is the normal-inverse-2 distribution (3.6). We can use the inverse-Wishart distribution, a multivariate generalization of the scaled inverse-2, to describe the prior distribution of the matrix σ. The conjugate prior distribution for (μ, ), the normal-inverse-Wishart, is conveniently parameterized in terms of hyperparameters :

which corresponds to the joint prior density

The parameters ν0 and Λ0 describe the degrees of freedom and the scale matrix for the inverse-Wishart distribution on σ. The remaining parameters are the prior mean, μ0, and the number of prior measurements, κ0, on the σ scale. Multiplying the prior density by the normal likelihood results in a posterior density of the same family with parameters

where S is the sum of squares matrix about the sample mean,

Other results from the univariate normal easily generalize to the multivariate case. The marginal posterior distribution of μ is multivariate The posterior predictive distribution of a new observation is also multivariate t with an additional factor of κn + 1 in the numerator of the scale matrix. Samples from the joint posterior distribution of (μ, σ) are easily obtained using the following procedure: first, draw Inv-Wishartνnn-1), then draw . See Appendix A for drawing from inverse-Wishart and multivariate normal distributions. To draw from the posterior predictive distribution of a new observation, draw , given the already drawn values of μ and σ.

Different noninformative prior distributions

Inverse-Wishart with d + 1 degrees of freedom. Setting σ ~ Inv-Wishartd+1 (I) has the appealing feature that each of the correlations in σ has, marginally, a uniform prior distribution. (The joint distribution is not uniform, however, because of the constraint that the correlation matrix be positive definite.)

Inverse-Wishart with d - 1 degrees of freedom. Another proposed noninformative prior distribution is the multivariate Jeffreys prior density,

which is the limit of the conjugate prior density as . The corresponding posterior distribution can be written as

Results for the marginal distribution of μ and the posterior predictive distribution of , assuming that the posterior distribution is proper, follow from the previous paragraph. For example, the marginal posterior distribution of μ is multivariate .

Dose, xi (log g/ml) Number of animals, ni Number of deaths, yi
-0.86 5 0
-0.30 5 1
-0.05 5 3
0.73 5 5

Table 3.1: Bioassay data from Racine et al. (1986).

Scaled inverse-Wishart model

When modeling covariance matrices it can help to extend the inverse-Wishart model by multiplying by a set of scale parameters that can be modeled separately. This gives flexibility in modeling and allows one to set up a uniform or weak prior distribution on correlations without overly constraining the variance parameters. The scaled inverse-Wishart model for σ has the form,

where ση is given an inverse-Wishart prior distribution (one choice is Inv-Wishartd+1(I), so that the marginal distributions of the correlations are uniform) and then the scale parameters ξ can be given weakly informative priors themselves. We discuss further in Section 15.4 in the context of varying-intercept, varying-slope hierarchical regression models.

3.7   Example: analysis of a bioassay experiment

Beyond the normal distribution, few multiparameter sampling models allow simple explicit calculation of posterior distributions. Data analysis for such models is possible using the computational methods described in Part III of this book. Here we present an example of a nonconjugate model for a bioassay experiment, drawn from the literature on applied Bayesian statistics. The model is a two-parameter example from the broad class of generalized linear models to be considered more thoroughly in Chapter 16. We use a particularly simple simulation approach, approximating the posterior distribution by a discrete distribution supported on a two-dimensional grid of points, that provides sufficiently accurate inferences for this two-parameter example.

The scientific problem and the data

In the development of drugs and other chemical compounds, acute toxicity tests or bioassay experiments are commonly performed on animals. Such experiments proceed by administering various dose levels of the compound to batches of animals. The animals’ responses are typically characterized by a dichotomous outcome: for example, alive or dead, tumor or no tumor. An experiment of this kind gives rise to data of the form

where xi represents the ith of k dose levels (often measured on a logarithmic scale) given to ni animals, of which yi subsequently respond with positive outcome. An example of real data from such an experiment is shown in Table 3.1: twenty animals were tested, five at each of four dose levels.

Modeling the dose—response relation

Given what we have seen so far, we must model the outcomes of the five animals within each group i as exchangeable, and it seems reasonable to model them as independent with equal probabilities, which implies that the data points yi are binomially distributed:

where θi is the probability of death for animals given dose xi. (An example of a situation in which independence and the binomial model would not be appropriate is if the deaths were caused by a contagious disease.) For this experiment, it is also reasonable to treat the outcomes in the four groups as independent of each other, given the parameters θ1, …, θ4.

The simplest analysis would treat the four parameters θi as exchangeable in their prior distribution, perhaps using a noninformative density such as , in which case the parameters θi would have independent beta posterior distributions. The exchangeable prior model for the θi parameters has a serious flaw, however; we know the dose level xi for each group i, and one would expect the probability of death to vary systematically as a function of dose.

The simplest model of the dose-response relation—that is, the relation of is linear: . Unfortunately, this model has the flaw that at low or high doses, xi approaches (recall that the dose is measured on the log scale), whereas θi, being a probability, must be constrained to lie between 0 and 1. The standard solution is to use a transformation of the θ’s, such as the logistic, in the dose—response relation:

where logit(θi) = log(θi/(1 - θi)) as defined in (1.10). This is called a logistic regression model.

The likelihood

Under the model (3.15), we can write the sampling distribution, or likelihood, for each group i in terms of the parameters α and β as

The model is characterized by the parameters α and β, whose joint posterior distribution is

We consider the sample sizes ni and dose levels xi as fixed for this analysis and suppress the conditioning on (n, x) in subsequent notation.

The prior distribution

We present an analysis based on a prior distribution for (α, β) that is independent and locally uniform in the two parameters; that is, p(α, β) ∝ 1. In practice, we might use a uniform prior distribution if we really have no prior knowledge about the parameters, or if we want to present a simple analysis of this experiment alone. If the analysis using the noninformative prior distribution is insufficiently precise, we may consider using other sources of substantive information (for example, from other bioassay experiments) to construct an informative prior distribution.

Figure 3.3  (a) Contour plot for the posterior density of the parameters in the bioassay example. Contour lines are at 0.05, 0.15, …, 0.95 times the density at the mode. (b) Scatterplot of 1000 draws from the posterior distribution.

A rough estimate of the parameters

We will compute the joint posterior distribution (3.16) at a grid of points (α, β), but before doing so, it is a good idea to get a rough estimate of (α, β) so we know where to look. To obtain the rough estimate, we use existing software to perform a logistic regression; that is, finding the maximum likelihood estimate of (α, β) in (3.16) for the four data points in Table 3.1. The estimate is , with standard errors of 1.0 and 4.9 for α and β, respectively.

Obtaining a contour plot of the joint posterior density

We are now ready to compute the posterior density at a grid of points (α, β). After some experimentation, we use the range (α, β) [-5, 10] × [-10, 40], which captures almost all the mass of the posterior distribution. The resulting contour plot appears in Figure 3.3a; a general justification for setting the lowest contour level at 0.05 for two-dimensional plots appears on page 85 in Section 4.1.

Sampling from the joint posterior distribution

Having computed the unnormalized posterior density at a grid of values that cover the effective range of (α, β), we can normalize by approximating the distribution as a step function over the grid and setting the total probability in the grid to 1. We sample 1000 random draws (αs, βs) from the posterior distribution using the following procedure.

1.  Compute the marginal posterior distribution of α by numerically summing over β in the discrete distribution computed on the grid of Figure 3.3a.

2.  For s = 1, …, 1000:

(a)   Draw αs from the discretely computed py); this can be viewed as a discrete version of the inverse cdf method described in Section 1.9.

(b)   Draw βs from the discrete conditional distribution, p(βα, y), given the just-sampled value of α.

(c)   For each of the sampled α and β, add a uniform random jitter centered at zero with a width equal to the spacing of the sampling grid. This gives the simulation draws a continuous distribution.

Figure 3.4  Histogram of the draws from the posterior distribution of the LD50 (on the scale of log dose in g/ml) in the bioassay example, conditional on the parameter β being positive.

The 1000 draws (αs, βs) are displayed on a scatterplot in Figure 3.3b. The scale of the plot, which is the same as the scale of Figure 3.3a, has been set large enough that all the 1000 draws would fit on the graph.

There are a number of practical considerations when applying this two-dimensional grid approximation. There can be difficulty finding the correct location and scale for the grid points. A grid that is defined on too small an area may miss important features of the posterior distribution that fall outside the grid. A grid defined on a large area with wide intervals between points can miss important features that fall between the grid points. It is also important to avoid overflow and underflow operations when computing the posterior distribution. It is usually a good idea to compute the logarithm of the unnormalized posterior distribution and subtract off the maximum value before exponentiating. This creates an unnormalized discrete approximation with maximum value 1, which can then be normalized (by setting the total probability in the grid to 1).

The posterior distribution of the LD50

A parameter of common interest in bioassay studies is the LD50—the dose level at which the probability of death is 50%. In our logistic model, a 50% survival rate means

thus,, and the LD50 is . Computing the posterior distribution of any summaries in the Bayesian approach is straightforward, as discussed at the end of Section 1.9. Given what we have done so far, simulating the posterior distribution of the LD50 is trivial: we just compute -α/β for the 1000 draws of (α, β) pictured in Figure 3.3b.

Difficulties with the LD50 parameterization if the drug is beneficial. In the context of this example, LD50 is a meaningless concept if β ≤ 0, in which case increasing the dose does not cause the probability of death to increase. If we were certain that the drug could not cause the tumor rate to decrease, we should constrain the parameter space to exclude values of β less than 0. However, it seems more reasonable here to allow the possibility of β ≤ 0 and just note that LD50 is hard to interpret in this case.

We summarize the inference on the LD50 scale by reporting two results: (1) the posterior probability that β > 0—that is, that the drug is harmful—and (2) the posterior distribution for the LD50 conditional on β > 0. All of the 1000 simulation draws had positive values of β, so the posterior probability that β > 0 is roughly estimated to exceed 0.999. We compute the LD50 for the simulation draws with positive values of β (which happen to be all 1000 draws for this example); a histogram is displayed in Figure 3.4. This example illustrates that the marginal posterior mean is not always a good summary of inference about a parameter. We are not, in general, interested in the posterior mean of the LD50, because the posterior mean includes the cases in which the dose—response relation is negative.

3.8   Summary of elementary modeling and computation

The lack of multiparameter models permitting easy calculation of posterior distributions is not a major practical handicap for three main reasons. First, when there are few parameters, posterior inference in nonconjugate multiparameter models can be obtained by simple simulation methods, as we have seen in the bioassay example. Second, sophisticated models can often be represented in a hierarchical or conditional manner, as we shall see in Chapter 5, for which effective computational strategies are available (as we discuss in general in Part III). Finally, as we discuss in Chapter 4, we can often apply a normal approximation to the posterior distribution, and therefore the conjugate structure of the normal model can play an important role in practice, well beyond its application to explicitly normal sampling models.

Our successful analysis of the bioassay example suggests the following strategy for computation of simple Bayesian posterior distributions. What follows is not truly a general approach, but it summarizes what we have done so far and foreshadows the general methods—based on successive approximations—presented in Part III.

1.  Write the likelihood part of the model, p(yθ), ignoring any factors that are free of θ.

2.  Write the posterior density, p(θy) ∝ p(θ)p(yθ). If prior information is well-formulated, include it in p(θ). Otherwise use a weakly informative prior distribution or temporarily set p(θ) ∝ constant, with the understanding that the prior density can be altered later to include additional information or structure.

3.  Create a crude estimate of the parameters, θ, for use as a starting point and a comparison to the computation in the next step.

4.  Draw simulations θ1, …, θS, from the posterior distribution. Use the sample draws to compute the posterior density of any functions of θ that may be of interest.

5.  If any predictive quantities, , are of interest, simulate 1, …, S by drawing each s from the sampling distribution conditional on the drawn value . In Chapter 6, we discuss how to use posterior simulations of θ and to check the fit of the model to data and substantive knowledge.

For nonconjugate models, step 4 above can be difficult. Various methods have been developed to draw posterior simulations in complicated models, as we discuss in Part III. Occasionally, high-dimensional problems can be solved by combining analytical and numerical simulation methods. If θ has only one or two components, it is possible to draw simulations by computing on a grid, as we illustrated in the previous section for the bioassay example.

3.9   Bibliographic note

Chapter 2 of Box and Tiao (1973) thoroughly treats the univariate and multivariate normal distribution problems and also some related problems such as estimating the difference between two means and the ratio between two variances. At the time that book was written, computer simulation methods were much less convenient than they are now, and so Box and Tiao, and other Bayesian authors of the period, restricted their attention to conjugate families and devoted much effort to deriving analytic forms of marginal posterior densities.

Table 3.2  Number of respondents in each preference category from ABC News pre- and post-debate surveys in 1988.

Many textbooks on multivariate analysis discuss the unique mathematical features of the multivariate normal distribution, such as the property that all marginal and conditional distributions of components of a multivariate normal vector are normal; for example, see Mardia, Kent, and Bibby (1979).

Simon Newcomb’s data, along with a discussion of his experiment, appear in Stigler (1977).

The multinomial model and corresponding informative and noninformative prior distributions are discussed by Good (1965) and Fienberg (1977); also see the bibliographic note on loglinear models at the end of Chapter 16.

The data and model for the bioassay example appear in Racine et al. (1986), an article that presents several examples of simple Bayesian analyses that have been useful in the pharmaceutical industry.

3.10   Exercises

1.  Binomial and multinomial models: suppose data (y1, …, yJ) follow a multinomial distribution with parameters (θ1, …, θJ). Also suppose that θ = (θ1, …, θJ) has a Dirichlet prior distribution. Let .

(a)   Write the marginal posterior distribution for α.

(b)   Show that this distribution is identical to the posterior distribution for α obtained by treating y1 as an observation from the binomial distribution with probability α and sample size y1 + y2, ignoring the data y3, …, yJ.

This result justifies the application of the binomial distribution to multinomial problems when we are only interested in two of the categories; for example, see the next problem.

2.  Comparison of two multinomial observations: on September 25, 1988, the evening of a presidential campaign debate, ABC News conducted a survey of registered voters in the United States; 639 persons were polled before the debate, and 639 different persons were polled after. The results are displayed in Table 3.2. Assume the surveys are independent simple random samples from the population of registered voters. Model the data with two different multinomial distributions. For j = 1, 2, let αj be the proportion of voters who preferred Bush, out of those who had a preference for either Bush or Dukakis at the time of survey j. Plot a histogram of the posterior density for α2 - α1. What is the posterior probability that there was a shift toward Bush?

3.  Estimation from two independent experiments: an experiment was performed on the effects of magnetic fields on the flow of calcium out of chicken brains. Two groups of chickens were involved: a control group of 32 chickens and an exposed group of 36 chickens. One measurement was taken on each chicken, and the purpose of the experiment was to measure the average flow μc in untreated (control) chickens and the average flow μt in treated chickens. The 32 measurements on the control group had a sample mean of 1.013 and a sample standard deviation of 0.24. The 36 measurements on the treatment group had a sample mean of 1.173 and a sample standard deviation of 0.20.

(a)   Assuming the control measurements were taken at random from a normal distribution with mean μc and variance σc2, what is the posterior distribution of μc? Similarly, use the treatment group measurements to determine the marginal posterior distribution of μt. Assume a uniform prior distribution on (μc, μt, log σc, log σt).

(b)   What is the posterior distribution for the difference, μt - μc? To get this, you may sample from the independent t distributions you obtained in part (a) above. Plot a histogram of your samples and give an approximate 95% posterior interval for μt - μc.

The problem of estimating two normal means with unknown ratio of variances is called the Behrens—Fisher problem.

4.  Inference for a 2 × 2 table: an experiment was performed to estimate the effect of beta-blockers on mortality of cardiac patients. A group of patients were randomly assigned to treatment and control groups: out of 674 patients receiving the control, 39 died, and out of 680 receiving the treatment, 22 died. Assume that the outcomes are independent and binomially distributed, with probabilities of death of p0 and p1 under the control and treatment, respectively. We return to this example in Section 5.6.

(a)   Set up a noninformative prior distribution on (p0, p1) and obtain posterior simulations.

(b)   Summarize the posterior distribution for the odds ratio, (p1/(1 - p1))/(p0/(1 - p0)).

(c)   Discuss the sensitivity of your inference to your choice of noninformative prior density.

5.  Rounded data: it is a common problem for measurements to be observed in rounded form (for a review, see Heitjan, 1989). For a simple example, suppose we weigh an object five times and measure weights, rounded to the nearest pound, of 10, 10, 12, 11, 9. Assume the unrounded measurements are normally distributed with a noninformative prior distribution on the mean μ and variance σ2.

(a)   Give the posterior distribution for (μ, σ2) obtained by pretending that the observations are exact unrounded measurements.

(b)   Give the correct posterior distribution for (μ, σ2) treating the measurements as rounded.

(c)   How do the incorrect and correct posterior distributions differ? Compare means, variances, and contour plots.

(d)   Let z = (z1, …, z5) be the original, unrounded measurements corresponding to the five observations above. Draw simulations from the posterior distribution of z. Compute the posterior mean of (z1 - z2)2.

6.  Binomial with unknown probability and sample size: some of the difficulties with setting prior distributions in multiparameter models can be illustrated with the simple binomial distribution. Consider data y1, …, yn modeled as independent Bin(N, θ), with both N and θ unknown. Defining a convenient family of prior distributions on (N, θ) is difficult, partly because of the discreteness of N.

Raftery (1988) considers a hierarchical approach based on assigning the parameter N a Poisson distribution with unknown mean μ. To define a prior distribution on (θ, N), Raftery defines λ = μθ and specifies a prior distribution on (λ, θ). The prior distribution is specified in terms of λ rather than μ because ‘it would seem easier to formulate prior information about λ, the unconditional expectation of the observations, than about μ, the mean of the unobserved quantity N.’

(a)   A suggested noninformative prior distribution is p(λ, θ) ∝ λ-1. What is a motivation for this noninformative distribution? Is the distribution improper? Transform to determine p(N, θ).

(b)   The Bayesian method is illustrated on counts of waterbuck obtained by remote photography o five separate days in Kruger Park in South Africa. The counts were 53, 57, 66, 67, and 72. Perform the Bayesian analysis on these data and display a scatterplot of posterior simulations of (N, θ). What is the posterior probability that N > 100?

(c)   Why not simply use a Poisson with fixed μ as a prior distribution for N?

Type of street Bike route? Counts of bicycles/other vehicles
Residential yes 16/58, 9/90, 10/48, 13/57, 19/103, 20/57, 18/86, 17/112, 35/273, 55/64
Residential no 12/113, 1/18, 2/14, 4/44, 9/208, 7/67, 9/29, 8/154
Fairly busy yes 8/29, 35/415, 31/425, 19/42, 38/180, 47/675, 44/620, 44/437, 29/47, 18/462
Fairly busy no 10/557, 43/1258, 5/499, 14/601, 58/1163, 15/700, 0/90, 47/1093, 51/1459, 32/1086
Busy yes 60/1545, 51/1499, 58/1598, 59/503, 53/407, 68/1494, 68/1558, 60/1706, 71/476, 63/752
Busy no 8/1248, 9/1246, 6/1596, 9/1765, 19/1290, 61/2498, 31/2346, 75/3101, 14/1918, 25/2318

Table 3.3  Counts of bicycles and other vehicles in one hour in each of 10 city blocks in each of six categories. (The data for two of the residential blocks were lost.) For example, the first block had 16 bicycles and 58 other vehicles, the second had 9 bicycles and 90 other vehicles, and so on. Streets were classified as ‘residential,’ ‘fairly busy,’ or ‘busy’ before the data were gathered.

7.  Poisson and binomial distributions: a student sits on a street corner for an hour and records the number of bicycles b and the number of other vehicles v that go by. Two models are considered:

•  The outcomes b and v have independent Poisson distributions, with unknown means θb and θv.

•  The outcome b has a binomial distribution, with unknown probability p and sample size .

Show that the two models have the same likelihood if we define .

8.  Analysis of proportions: a survey was done of bicycle and other vehicular traffic in the neighborhood of the campus of the University of California, Berkeley, in the spring of 1993. Sixty city blocks were selected at random; each block was observed for one hour, and the numbers of bicycles and other vehicles traveling along that block were recorded. The sampling was stratified into six types of city blocks: busy, fairly busy, and residential streets, with and without bike routes, with ten blocks measured in each stratum. Table 3.3 displays the number of bicycles and other vehicles recorded in the study. For this problem, restrict your attention to the first four rows of the table: the data on residential streets.

(a)   Let and be the observed proportion of traffic that was on bicycles in the residential streets with bike lanes and with no bike lanes, respectively (so y1 = 16/(16 + 58) and z1 = 12/(12 + 113), for example). Set up a model so that the yi’s are independent and identically distributed given parameters θy and the zi’s are independent and identically distributed given parameters θz.

(b)   Set up a prior distribution that is independent in θy and θz.

(c)   Determine the posterior distribution for the parameters in your model and draw 1000 simulations from the posterior distribution. (Hint: θy and θz are independent in the posterior distribution, so they can be simulated independently.)

(d)   Let μy = E(yiθy) be the mean of the distribution of the yi’s; μy will be a function of θy. Similarly, define μz. Using your posterior simulations from (c), plot a histogram of the posterior simulations of μy - μz, the expected difference in proportions in bicycle traffic on residential streets with and without bike lanes.

We return to this example in Exercise 5.13.

9.  Conjugate normal model: suppose y is an independent and identically distributed sample of size n from the distribution N(μ, σ2), where have the N-Inv- prior distribution, (that is, and . The posterior distribution, , is also normal-inverse-2; derive explicitly its parameters in terms of the prior parameters and the sufficient statistics of the data.

10.  Comparison of normal variances: for j = 1, 2, suppose that

and are independent of (μ2, σ22) in the prior distribution. Show that the posterior distribution of with (n1 -1) and (n2 - 1) degrees of freedom. (Hint: to show the required form of the posterior density, you do not need to carry along all the normalizing constants.)

11.  Computation: in the bioassay example, replace the uniform prior density by a joint normal prior distribution on (α, β), with α ~ N(0, 22), β ~ N(10, 102), and corr(α, β)=0.5.

(a)   Repeat all the computations and plots of Section 3.7 with this new prior distribution.

(b)   Check that your contour plot and scatterplot look like a compromise between the prior distribution and the likelihood (as displayed in Figure 3.3).

(c)   Discuss the effect of this hypothetical prior information on the conclusions in the applied context.

12.  Poisson regression model: expand the model of Exercise 2.13(a) by assuming that the number of fatal accidents in year t follows a Poisson distribution with mean α + βt. You will estimate α and β, following the example of the analysis in Section 3.7.

(a)   Discuss various choices for a ‘noninformative’ prior for (α, β). Choose one.

(b)   Discuss what would be a realistic informative prior distribution for (α, β). Sketch its contours and then put it aside. Do parts (c)—(h) of this problem using your noninformative prior distribution from (a).

(c)   Write the posterior density for (α, β). What are the sufficient statistics?

(d)   Check that the posterior density is proper.

(e)   Calculate crude estimates and uncertainties for (α, β) using linear regression.

(f)   Plot the contours and take 1000 draws from the joint posterior density of (α, β).

(g)   Using your samples of (α, β), plot a histogram of the posterior density for the expected number of fatal accidents in 1986, α + 1986β.

(h)   Create simulation draws and obtain a 95% predictive interval for the number of fatal accidents in 1986.

(i)   How does your hypothetical informative prior distribution in (b) differ from the posterior distribution in (f) and (g), obtained from the noninformative prior distribution and the data? If they disagree, discuss.

13.  Multivariate normal model: derive equations (3.13) by completing the square in vector-matrix notation.

14.  Improper prior and proper posterior distributions: prove that the posterior density (3.16) for the bioassay example has a finite integral over the range (α, β) (-∞, ∞) × (-∞, ∞).

15.  Joint distributions: The autoregressive time-series model y1, y2, … with mean level 0, autocorrelation 0.8, residual standard deviation 1, and normal errors can be written as for all t.

(a)   Prove that the distribution of yt, given the observations at all other integer time points t, depends only on .

(b)   What is the distribution of

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset