Chapter 17

Models for robust inference

So far, we have relied primarily upon the normal, binomial, and Poisson distributions, and hierarchical combinations of these, for modeling data and parameters. The use of a limited class of distributions results, however, in a limited and potentially inappropriate class of inferences. Many problems fall outside the range of convenient models, and models should be chosen to fit the underlying science and data, not simply for their analytical or computational convenience. As illustrated in Chapter 5, often the most useful approach for creating realistic models is to work hierarchically, combining simple univariate models. If, for convenience, we use simplistic models, it is important to answer the following question: in what ways does the posterior inference depend on extreme data points and on unassessable model assumptions? We have already discussed, in Chapter 6, the latter part of this question, which is essentially the subject of sensitivity analysis; here we return to the topic in greater detail, using more advanced computational methods.

17.1   Aspects of robustness

Robustness of inferences to outliers

Models based on the normal distribution are notoriously ‘nonrobust’ to outliers, in the sense that a single aberrant data point can strongly affect the inference for all the parameters in the model, even those with little substantive connection to the outlying observation.

For example, in the educational testing example of Section 5.5, our estimates for the eight treatment effects were obtained by shifting the individual school means toward the grand mean (or, in other words, shifting toward the prior information that the true effects came from a common normal distribution), with the proportionate shifting for each school j determined only by its sampling error, σj, and the variation τ between school effects. Suppose that the observation for the eighth school in the study, y8 in Table 5.2 on page 120, had been 100 instead of 12, so that the eight observations were 28, 8, -3, 7, -1, 1, 18, and 100, with the same standard errors as reported in Table 5.2. If we were to apply the hierarchical normal model to this dataset, our posterior distribution would tell us that τ has a high value, and thus each estimate would be essentially equal to its observed effect yj; see equation (5.17) and Figure 5.6. But does this make sense in practice? After all, given these hypothetical observations, the eighth school would seem to have an extremely effective SAT coaching program, or maybe the 100 is just the result of a data recording error. In either case, it would not seem right for the single observation y8 to have such a strong influence on how we estimate θ1, …, θ7.

In the Bayesian framework, we can reduce the influence of the aberrant eighth observation by replacing the normal population model for the θj’s by a longer-tailed family of distributions, which allows for the possibility of extreme observations. By long-tailed, we mean a distribution with relatively high probability content far away from the center, where the scale of ‘far away’ is determined, for example, relative to the diameter of a region containing 50% of the probability in the distribution. Examples of long-tailed distributions include the family of t distributions, of which the most extreme case is the Cauchy or t1, and also (finite) mixture models, which generally use a simple distribution such as the normal for the bulk of values but allow a discrete probability of observations or parameter values from an alternative distribution that can have a different center and generally has a much larger spread. In the hypothetical modification of the SAT coaching example, performing an analysis using a long-tailed distribution for the θj’s would result in the observation 100 being interpreted as arising from an extreme draw from the long-tailed distribution rather than as evidence that the normal distribution of effects has a high variance. The resulting analysis would shrink the eighth observation somewhat toward the others, but not nearly as much (relative to its distance from the overall mean) as the first seven are shrunk toward each other. (Given this hypothetical dataset, the posterior probability Pr(θ8 > 100y) should presumably be somewhat less than 0.5, and this justifies some shrinkage.)

As our hypothetical example indicates, we do not have to abandon Bayesian principles to handle outliers. For example, a long-tailed model such as a Cauchy distribution or even a two-component mixture (see Exercise 17.1) is still an exchangeable prior distribution for (θ1, …, θ8), as is appropriate when there is no prior information distinguishing among the eight schools. The choice of exchangeable prior model affects the manner in which the estimates of the θj’s are shrunk, and we can thereby reduce the effect of an outlying observation without having to treat it in a fundamentally different way in the analysis. (This should not replace careful examination of the data and checking for possible recording errors in outlying values.) A distinction is sometimes made between methods that search for outliers—possibly to remove them from the analysis—and robust procedures that are invulnerable to outliers. In the Bayesian framework, the two approaches should not be distinguished. For instance, using mixture models (either finite mixture models as in Chapter 22 or overdispersed versions of standard models) not only results in categorizing extreme observations as arising from high-variance mixture components (rather than simply surprising ‘outliers’) but also implies that these points have less influence on inferences for estimands such as population means and medians.

Sensitivity analysis

In addition to compensating for outliers, robust models can be used to assess the sensitivity of posterior inferences to model assumptions. For example, one can use a robust model that applies the t in place of a normal distribution to assess sensitivity to the normal assumption by varying the degrees of freedom from large to small. As discussed in Chapter 6, the basic idea of sensitivity analysis is to try a variety of different distributions (for likelihood and prior models) and see how posterior inferences vary for estimands and predictive quantities of interest. Once samples have already been drawn from the posterior distribution under one model, it is often straightforward to draw from alternative models using importance resampling with enough accuracy to detect major differences in inferences between the models (see Section 17.3). If the posterior distribution of estimands of interest is highly sensitive to the model assumptions, iterative simulation methods might be required for more accurate computation.

In a sense, much of the analysis of the SAT coaching experiments in Section 5.5, especially Figures 5.6 and 5.7, is a sensitivity analysis, in which the parameter τ is allowed to vary from 0 to ∞. As discussed in Section 5.5, the observed data are actually consistent with the model of all equal effects (that is, τ = 0), but that model makes no substantive sense, so we fit the model allowing τ to be any positive value. The result is summarized in the marginal posterior distribution for τ (shown in Figure 5.5), which describes a range of values of τ that are supported by the data.

17.2 Overdispersed versions of standard probability models

Sometimes it will appear natural to use one of the standard models—binomial, normal, Poisson, exponential—except that the data are too dispersed. For example, the normal distribution should not be used to fit a large sample in which 10% of the points lie a distance more than 1.5 times the interquartile range away from the median. In the hypothetical example of the previous section we suggested that the prior or population model for the θjs should have longer tails than the normal. For each of the standard models, there is in fact a natural extension in which a single parameter is added to allow for overdispersion. Each of the extended models has an interpretation as a mixture distribution.

A feature of all these distributions is that they can never be underdispersed. This makes sense in light of formulas (2.7) and (2.8) and the mixture interpretations: the mean of the generalized distribution is equal to that of the underlying family, but the variance is higher. If the data are believed to be underdispersed relative to the standard distribution, different models should be used.

The t distribution in place of the normal

The t distribution has a longer tail than the normal and can be used for accommodating (1) occasional unusual observations in a data distribution or (2) occasional extreme parameters in a prior distribution or hierarchical model. The t family of distributions—tν(μ, σ2)—is characterized by three parameters: center μ, scale σ, and a ‘degrees of freedom’ parameter ν that determines the shape of the distribution. The t densities are symmetric, and ν must fall in the range (0, ∞). At ν = 1, the t is equivalent to the Cauchy distribution, which is so long-tailed it has infinite mean and variance, and as ν → ∞, the t approaches the normal distribution. If the t distribution is part of a probability model attempting accurately to fit a long-tailed distribution, based on a reasonably large quantity of data, then it is generally appropriate to include the degrees of freedom as an unknown parameter. In applications for which the t is chosen simply as a robust alternative to the normal, the degrees of freedom can be fixed at a small value to allow for outliers, but no smaller than prior understanding dictates. For example, t’s with one or two degrees of freedom have infinite variance and are not usually realistic in the far tails.

Mixture interpretation. Recall from Sections 3.2 and 12.1 that the tν(μ, σ2) distribution can be interpreted as a mixture of normal distributions with a common mean and variances distributed as scaled inverse-2. For example, the model yi ~ tν(μ, σ2) is equivalent to

an expression we have already introduced as (12.1) on page 294 to illustrate the computational methods of auxiliary variables and parameter expansion. Statistically, the observations with high variance can be considered the outliers in the distribution. A similar interpretation holds when modeling exchangeable parameters θj.

Negative binomial alternative to Poisson

A common difficulty in applying the Poisson model to data is that the Poisson model requires that the variance equal the mean; in practice, distributions of counts often are overdispersed, with variance greater than the mean. We have already discussed overdispersion in the context of generalized linear models (see Section 16.1), and Section 16.4 gives an example of a hierarchical normal model for overdispersed Poisson regression.

Another way to model overdispersed count data is using the negative binomial distribution, a two-parameter family that allows the mean and variance to be fitted separately, with variance at least as great as the mean. Data y1, …, yn that follow a Neg-bin(α, β) distribution can be thought of as Poisson observations with means λ1, …, λn, which follow a Gamma(α, β) distribution. The variance of the negative binomial distribution is β+1/β α/β, which is always greater than the mean, α/β, in contrast to the Poisson, whose variance is always equal to its mean. In the limit as β → ∞ with α/β remaining constant, the underlying gamma distribution approaches a spike, and the negative binomial distribution approaches the Poisson.

Beta-binomial alternative to binomial

Similarly, the binomial model for discrete data has the practical limitation of having only one free parameter, which means the variance is determined by the mean. A standard robust alternative is the beta-binomial distribution, which, as the name suggests, is a beta mixture of binomials. The beta-binomial is used, for example, to model educational testing data, where a ‘success’ is a correct response, and individuals vary greatly in their probabilities of getting a correct response. Here, the data yi—the number of correct responses for each individual i = 1, …, n—are modeled with a Beta-bin(m, α, β) distribution and are thought of as binomial observations with a common number of trials m and unequal probabilities distribution. The variance of the beta-binomial with mean probability α/α+β is greater by a factor of than that of the binomial with the same probability; see Table A.1 in Appendix A. When m = 1, no information is available to distinguish between the beta and binomial variation, and the two models have equal variances.

The t distribution alternative to logistic and probit regression

Logistic and probit regressions can be nonrobust in the sense that for large absolute values of the linear predictor Xβ, the inverse logit or probit transformations give probabilities close to 0 or 1. Such models could be made more robust by allowing the occasional misprediction for large values of Xβ. This form of robustness is defined not in terms of the data y—which equal 0 or 1 in binary regression—but with respect to the predictors X. A more robust model allows the discrete regression model to fit most of the data while occasionally making isolated errors.

A robust model, robit regression, can be implemented using the latent-variable formulation of discrete-data regression models (see page 408), replacing the logistic or normal distribution of the latent continuous data u with the model, ui ~ tν((Xβ)i, 1). In realistic settings it is impractical to estimate ν from the data—since the latent data ui are never directly observed, it is essentially impossible to form inference about the shape of their continuous underlying distribution—so it is set at a low value to ensure robustness. Setting ν = 4 yields a distribution that is close to the logistic, and as ν → ∞, the model approaches the probit. Computation for the binary t regression can be performed using the EM algorithm and Gibbs sampler with the normal-mixture formulation (17.1) for the t distribution of the latent data u. In that approach, ui and the variance of each ui are treated as missing data.

Why ever use a nonrobust model?

The t family includes the normal as a special case, so why do we ever use the normal at all, or the binomial, Poisson, or other standard models? To start with, each of the standard models has a logical status that makes it plausible for many applied problems. The binomial and multinomial distributions apply to discrete counts for independent, identically distributed outcomes with a fixed total number of counts. The Poisson and exponential distributions fit the number of events and the waiting time for a Poisson process, which is a natural model for independent discrete events indexed by time. Finally, the central limit theorem tells us that the normal distribution is an appropriate model for data that are formed as the sum of a large number of independent components. In the educational testing example in Section 5.5, each of the observed effects, yj, is an average of adjusted test scores with nj ≈ 60 (that is, the estimated treatment effect is based on about 60 students in school j). We can thus accurately approximate the sampling distribution of yj by normality: .

Even when they are not naturally implied by the structure of a problem, the standard models are computationally convenient, since conjugate prior distributions often allow direct calculation of posterior means and variances and easy simulation. That is why it is easy to fit a normal population model to the θj’s in the educational testing example and why it is common to fit a normal model to the logarithm of all-positive data or the logit of data that are constrained to lie between 0 and 1. When a model is assigned in this more or less arbitrary manner, it is advisable to check the fit of the data using the posterior predictive distribution, as discussed in Chapter 6. But if we are worried that an assumed model is not robust, then it makes sense to perform a sensitivity analysis and see how much the posterior inference changes if we switch to a larger family of distributions, such as the t distributions in place of the normal.

17.3   Posterior inference and computation

As always, we can draw samples from the posterior distribution (or distributions, in the case of sensitivity analysis) using the methods described in Part III. In this section, we briefly describe the use of Gibbs sampling under the mixture formulation of a robust model. The approach is illustrated for a hierarchical normal-t model in Section 17.4. When expanding a model, however, we have the possibility of a less time-consuming approximation as an alternative: we can use the draws from the original posterior distribution as a starting point for simulations from the new models. In this section, we also describe two techniques that can be useful for robust models and sensitivity analysis: importance weighting for computing the marginal posterior density in a sensitivity analysis, and importance resampling (Section 10.4) for approximating a robust analysis.

Notation for robust model as expansion of a simpler model

We use the notation p0(θy) for the posterior distribution from the original model, which we assume has already been fitted to the data, and for the hyperparameter(s) characterizing the expanded model used for robustness or sensitivity analysis. Our goal is to sample from

using either a pre-specified value of (such as ν = 4 for a robust t model) or for a range of values of . In the latter case, we also wish to compute the marginal posterior distribution of the sensitivity analysis parameter, p(y).

The robust family of distributions can enter the model (17.2) through the distribution of the parameters, p(θ), or the data distribution, p(yθ, ). For example, Section 17.2 focuses on robust data distributions, and our reanalysis of the SAT coaching experiments in Section 17.4 uses a robust distribution for model parameters. We must then set up a joint prior distribution, p(θ, ), which can require some care because it captures the prior dependence between θ and .

Gibbs sampling using the mixture formulation

Markov chain simulation can be used to draw from the posterior distributions, p(θ, y). This can be done using the mixture formulation, by sampling from the joint posterior distribution of θ and the extra unobserved scale parameters (Vi’s in the t model, λi’s in the negative binomial, and πi’s in the beta-binomial).

For a simple example, consider the tν(μ, σ2) distribution fitted to data y1, …, yn, with μ and σ unknown. Given ν, we have already discussed in Section 12.1 how to program the Gibbs sampler in terms of the parameterization (17.1) involving μ, σ2, V1, …, Vn. If ν is itself unknown, the Gibbs sampler must be expanded to include a step for sampling from the conditional posterior distribution of ν. No simple method exists for this step, but a Metropolis step can be used instead. Another complication is that such models commonly have multimodal posterior densities, with different modes corresponding to different observations in the tails of the t distributions, meaning that additional work is required to search for modes initially and jump between modes in the simulation, for example using simulated tempering (see Section 12.3).

Sampling from the posterior predictive distribution for new data

To perform sensitivity analysis and robust inference for predictions , follow the usual procedure of first drawing θ from the posterior distribution, p(θ, y), and then drawing from the predictive distribution, p(, θ). To simulate data from a mixture model, first draw the mixture indicators for each future observation, then draw , given the mixture parameters. For example, to draw from a tν(μ, σ2) distribution, first draw .

Computing the marginal posterior distribution of the hyperparameters by importance weighting

During a check for model robustness or sensitivity to assumptions, we might like to avoid the additional programming effort required to apply Markov chain simulation to a robust model. If we have simulated draws from p0(θy), then it is possible to obtain approximate inference under the robust model using importance weighting and importance resampling. We assume in the remainder of this section that simulation draws θs, s = 1, …, S, have already been obtained from p0(θy). We can use importance weighting to evaluate the marginal posterior distribution, p(y), using identity (13.11) on page 326, which in our current notation becomes

In the first line above, the proportionality constant is 1/p(y), whereas in the second it is p0(y)/p(y). For any , the value of p(y), up to a constant factor, can be estimated by the average importance ratio for the simulations θs,

which can be evaluated, using a fixed set of S simulations, at each of a range of values of , and then graphed as a function of .

Approximating the robust posterior distributions by importance resampling

To perform importance resampling, it is best to start with a large number of draws, say S = 5000, from the original posterior distribution, p0(θy). Now, for each distribution in the expanded family indexed by , draw a smaller subsample, say k = 500, from the S draws, without replacement, using importance resampling, in which each of the k samples is drawn with probability proportional to its importance ratio,

A new set of subsamples must be drawn for each value of , but the same set of S original draws may be used. Details are given in Section 10.4. This procedure is effective as long as the largest importance ratios are plentiful and not too variable; if they do vary greatly, this is an indication of potential sensitivity because p(θ, y)/p0(θy) is sensitive to the drawn values of θ. If the importance weights are too variable for importance resampling to be considered accurate, and accurate inferences under the robust alternatives are desired, then we must rely on Markov chain simulation.

17.4   Robust inference and sensitivity analysis for the eight schools

Consider the hierarchical model for SAT coaching effects based on the data in Table 5.2 in Section 5.5. Given the large sample sizes in the eight original experiments, there should be little concern about assuming the data model that has yj ~ N(θj, σj2), with the variances σj2 known. The population model, θj ~ N(μ, τ2), is more difficult to justify, although the model checks in Section 6.5 suggest that it is adequate for the purposes of obtaining posterior intervals for the school effects. In general, however, posterior inferences can be highly sensitive to the assumed model, even when the model provides a good fit to the observed data. To illustrate methods for robust inference and sensitivity analysis, we explore an alternative family of models that fit t distributions to the population of school effects:

We use the notation p(θ, μ, τν, y) ∝ p(θ, μ, τν)p(yθ, μ, τ, ν) for the posterior distribution under the tν model and p0(θ, μ, τy) ≡ p(θ, μ, τν = ∞, y) for the posterior distribution under the normal model evaluated in Section 5.5.

Robust inference based on a t4 population distribution

As discussed at the beginning of this chapter, one might be concerned that the normal population model causes the most extreme estimated school effects to be pulled too much toward the grand mean. Perhaps the coaching program in school A, for example, is different enough from the others that its estimate should not be shrunk so much to the average. A related concern would be that the largest observed effect, in school A, may be exerting undue influence on estimation of the population variance, τ2, and thereby also on the Bayesian estimates of the other effects. From a modeling standpoint, there is a great variety of different SAT coaching programs, and the population of their effects might be better fitted by a long-tailed distribution. To assess the importance of these concerns, we perform a robust analysis, replacing the normal population distribution by the t model (17.4) with ν = 4 and leaving the rest of the model unchanged; that is, the likelihood is still , and the hyperprior distribution is still .

Gibbs sampling. We carry out Gibbs sampling using the approach described in Section 12.1 with ν = 4. (See Appendix C for details on fitting such a model in Stan or performing the Gibbs sampler in R.) The resulting inferences for the eight schools, based on 2500 draws from the posterior distribution (the last halves of five chains, each of length 1000), are provided in Table 17.1. The results are essentially identical, for practical purposes, to the inferences under the normal model displayed in Table 5.3 on page 123, with just slightly less shrinkage for the more extreme schools such as school A.

Table 17.1  Summary of 2500 simulations of the treatment effects in the eight schools, using the t4 population distribution in place of the normal. Results are similar to those obtained under the normal model and displayed in Table 5.3.

Computation using importance resampling. Though we have already done the Markov chain simulation, we discuss briefly how to apply importance resampling to approximate the posterior distribution with ν = 4. First, we sample 5000 draws of (θ, μ, τ) from p0(θ, μ, τy), the posterior distribution under the normal model, as described in Section 5.4. Next, we compute the importance ratio for each draw:

The factors for the likelihood and hyperprior density cancel in the importance ratio, leaving only the ratio of the population densities. We sample 500 draws of (θ, μ, τ), without replacement, from the sample of 5000, using importance resampling. In this case the approximation is probably sufficient for assessing robustness, but the long tail of the distribution of the logarithms of the importance ratios (not shown) does indicate serious problems for obtaining accurate inferences using importance resampling.

Sensitivity analysis based on tν distributions with varying values of ν

A slightly different concern from robustness is the sensitivity of the posterior inference to the prior assumption of a normal population distribution. To study the sensitivity, we now fit a range of t distributions, with 1, 2, 3, 5, 10, and 30 degrees of freedom. We have already fitted infinite degrees of freedom (the normal model) and 4 degrees of freedom (the robust model above).

For each value of ν, we perform a Markov chain simulation to obtain draws from . Instead of displaying a table of posterior summaries such as Table 17.1 for each value of ν, we summarize the results by the posterior mean and standard deviation of each of the eight school effects θj. Figure 17.1 displays the results as a function of 1/ν. The parameterization in terms of 1/ν rather than ν has the advantage of including the normal distribution at 1/ν = 0 and encompassing the entire range from normal to Cauchy distributions in the finite interval [0, 1]. There is some variation in the figures but no apparent systematic sensitivity of inferences to the hyperparameter, ν.

Figure 17.1  Posterior means and standard deviations of treatment effects as functions of ν, on the scale of 1/ν, for the sensitivity analysis of the educational testing example. The values at 1/ν=0 come from the simulations under the normal distribution in Section 5.5. Much of the scatter in the graphs is due to simulation variability.

Treating ν as an unknown parameter

Finally, we consider the sensitivity analysis parameter, ν, as an unknown quantity and average over it in the posterior distribution. In general, this computation is a key step, because we are typically only concerned with sensitivity to models that are supported by the data. In this particular example, inferences are so insensitive to ν that computing the marginal posterior distribution is unnecessary; we include it here as an illustration of the general method.

Prior distribution. Before computing the posterior distribution for ν, we must assign it a prior distribution. We try a uniform density on 1/ν for the range [0, 1] (that is, from the normal to the Cauchy distributions). This prior distribution favors long-tailed models, with half of the prior probability falling between the t1 (Cauchy) and t2 distributions.

In addition, the conditional prior distributions, p((μ, τν) ∝ 1, are improper, so we must specify their dependence on ν; we use the notation p(μ, τν) ∝ g(ν). In the t family, the parameters μ, and τ characterize the median and the second derivative of the density function at the median, not the mean and variance, of the distribution of the θj’s. The parameter μ seems to have a reasonable invariant meaning (and in fact is equal to the mean except in the limiting case of the Cauchy where the mean does not exist), but the interquartile range would perhaps be a more reasonable parameter than the curvature for setting up a prior distribution. We cannot parameterize the tν distributions in terms of their variance, because the variance is infinite for ν ≤ 2. The interquartile range varies mildly as a function of ν, and so for simplicity we use the convenient parameterization in terms of (μ, τ) and set g(ν) ∝ 1. Combining this with our prior distribution on ν yields an improper joint uniform prior density on (μ, τ, 1/ν). If our posterior inferences under this model turn out to depend strongly on ν, we should consider refining this prior distribution.

Posterior inference. To treat ν as an unknown parameter, we modify the Gibbs sampling simulation used in the robust analyses to include a Metropolis step for sampling from the conditional distribution of 1/ν. An example of the implementation of such an approach can be found in Appendix C. Figure 17.2 displays a histogram of the simulations of 1/ν. An alternative to extending the model is to approximate the marginal posterior density using importance sampling and (17.3).

The sensitivity analysis showed that ν has only minor effects on the posterior inference; the results in Section 5.5 are thus not strongly dependent on the normal assumption for the population distribution of the parameters θj. If Figure 17.1 had shown a strong dependence on ν—as Figures 5.55.7 showed dependence on τ—then it might make sense to include ν as a hyperparameter, after thinking more seriously about a joint prior distribution for the parameters with noninformative prior distributions—(μ, τ, ν).

Figure 17.2  Posterior simulations of 1/ν from the Gibbs-Metropolis computation of the robust model for the educational testing example, with ν treated as unknown.

Discussion

Robustness and sensitivity to modeling assumptions depend on the estimands being studied. In the SAT coaching example, posterior medians, 50%, and 95% intervals for the eight school effects are insensitive to the assumption of a normal population distribution (at least as compared to the t family). In contrast, it may be that 99.9% intervals are strongly dependent on the tails of the distributions and sensitive to the degrees of freedom in the t distribution—fortunately, these extreme tails are unlikely to be of substantive interest in this example.

17.5    Robust regression using t-distributed errors

As with other models based on the normal distribution, inferences under the normal linear regression model of Chapter 14 are sensitive to unusual or outlying values. Robust regression analyses are obtained by considering robust alternatives to the normal distribution for regression errors. Robust error distributions, such as the t with few degrees of freedom, treat observations far from the regression line as high-variance observations, yielding results similar to those obtained by downweighting outliers. (Recall that the ‘weights’ in weighted linear regression are inverse variances.)

Iterative weighted linear regression and the EM algorithm

To illustrate robust regression calculations, we consider the tν regression model with fixed degrees of freedom as an alternative to the normal linear regression model. The conditional distribution of the individual response variable yi given the vector of explanatory variables . The tν distribution can be expressed as a mixture as in equation (17.1) with Xiβ as the mean. As a first step in the robust analysis, we find the mode of the posterior distribution given the vector y consisting of n observations. Here we assume that a noninformative prior distribution is used, p(μ, log σν) ∝ 1; more substantial information about the regression parameters can be incorporated exactly as in Section 14.8 and Chapter 15. The posterior mode of p(β, log σν, y) under the t model can be obtained directly using Newton’s method (Section 13.1) or any other mode-finding technique. Alternatively, we can take advantage of the mixture form of the t model and use the EM algorithm with the variances Vi treated as ‘missing data’ (that is, parameters to be averaged over); in the notation of Section 13.4, γ = (V1, …, Vn). The E-step of the EM algorithm computes the expected value of the sufficient statistics for the normal model , given the current parameter estimates (βold, σold) and averaging over (V1, …, Vn). It is sufficient to note that

and that

The M-step of the EM algorithm is a weighted linear regression with diagonal weight matrix W containing the conditional expectations of 1/Vi on the diagonal. The updated parameter estimates are

where X is the n × p matrix of explanatory variables. The iterations of the EM algorithm are equivalent to those performed in an iterative weighted least squares algorithm. Given initial estimates of the regression parameters, weights are computed for each case, with those cases having large residuals given less weight. Improved estimates of the regression parameters are then obtained by weighted linear regression.

When the degrees of freedom parameter, ν, is treated as unknown, the ECME algorithm can be applied, with an additional step added to the iteration for updating the degrees of freedom.

Other robust models. Iterative weighted linear regression, or equivalently the EM algorithm, can be used to obtain the posterior mode for a number of robust alternative models. Changing the probability model used for the observation variances, Vi, creates alternative robust models. For example, a two-point distribution can be used to model a regression with contaminated errors. The computations for robust models of this form are as described above, except that the E-step is modified to reflect the appropriate posterior conditional mean.

Gibbs sampler and Metropolis algorithm

Posterior draws from robust regression models can be obtained using Gibbs sampling and the Metropolis algorithm, as with the linear and generalized linear models discussed in Chapters 1416. Using the mixture parameterization of the tν distribution, we can obtain draws from the posterior distribution p(β, σ2, V1, …, Vnν, y) by alternately sampling from p(β, σ2V1, …, Vn, ν, y) using the usual posterior distribution from weighted linear regression, and sampling from p(V1, …, Vnβ, σ2, ν, y), a set of independent scaled inverse-2 distributions as in equation (17.6). It can be even more effective to use parameter expansion as explained in Section 12.1.

If the degrees of freedom parameter, ν, is included as an unknown parameter in the model, then an additional Metropolis step is required in each iteration. In practice, these computations can be difficult to implement, because with low degrees of freedom ν, the posterior distribution can have many modes, and the Gibbs sampler and Metropolis algorithms can get stuck. It is important to run many simulations with overdispersed starting points for complicated models of this form.

17.6   Bibliographic note

Mosteller and Wallace (1964) use the negative binomial distribution, instead of the Poisson, for count data, and extensively study the sensitivity of their conclusions to model assumptions. Box and Tiao (1968) provide another early discussion of Bayesian robustness, in the context of outliers in normal models. Smith (1983) extends Box’s approach and also discusses the t family using the same parameterization (inverse degrees of freedom) as we have. A review of models for overdispersion in binomial data, from a non-Bayesian point of view, is given by Anderson (1988), who cites many further references. Gaver and O’Muircheartaigh (1987) discuss the use of hierarchical Poisson models for robust Bayesian inference. O’Hagan (1979) and Gelman (1992a) discuss the connection between the tails of the population distribution of a hierarchical model and the shrinkage in the associated Bayesian posterior distribution.

In a series of papers, Berger and coworkers have explored theoretical aspects of Bayesian robustness, examining, for example, families of prior distributions that provide maximum robustness against the influence of aberrant observations; see for instance Berger (1984, 1990) and Berger and Berliner (1986). Related work appears in Wasserman (1992). An earlier overview from a pragmatic point of view close to ours was provided by Dempster (1975). Rubin (1983a) provides an illustration of the limitations of data in assessing model fit and the resulting inevitable sensitivity of some conclusions to untestable assumptions.

With the recent advances in computation, modeling with the t distribution has become increasingly common in statistics. Dempster, Laird, and Rubin (1977) show how to apply the EM algorithm to t models, and Liu and Rubin (1995) and Meng and van Dyk (1997) discuss faster computational methods using extensions of EM. Lange, Little, and Taylor (1989) discuss the use of the t distribution in a variety of statistical contexts. Raghunathan and Rubin (1990) present an example using importance resampling. Tipping and Lawrence (2005) apply factorized variational approximation, Vanhatalo, Jylanki, Vehtari (2009) apply Laplace’s method, and Jylanki, Vanhatalo, and Vehtari (2011) apply expectation propagation t models. Liu (2004) presents the ‘robit’ model as an alternative to logistic and probit regression.

Rubin (1983b) and Lange and Sinsheimer (1993) review the connections between robust regression, the t and related distributions, and iterative regression computations.

Taplin and Raftery (1994) present an example of an application of a finite mixture model for robust Bayesian analysis of agricultural experiments.

17.7   Exercises

1.  Prior distributions and shrinkage: in the educational testing experiments, suppose we think that most coaching programs are almost useless, but some are strongly effective; a corresponding population distribution for the school effects is a mixture, with most of the mass near zero but some mass extending far in the positive direction; for example,

All these parameters could be estimated from the data (as long as we restrict the parameter space, for example by setting μ1 > μ2), but to fix ideas, suppose that μ1 = 0, τ1 = 10, μ2 = 15, τ2 = 25, λ1 = 0.9, and λ2 = 0.1.

(a)   Compute the posterior distribution of (θ1, …, θ8) under this model for the data in Table 5.2.

(b)   Graph the posterior distribution for θ8 under this model for y8 = 0, 25, 50, and 100, with the same standard deviation σ8 as given in Table 5.2. Describe qualitatively the effect of the two-component mixture prior distribution.

2.  Poisson and negative binomial distributions: as part of their analysis of the Federalist papers, Mosteller and Wallace (1964) recorded the frequency of use of various words in selected articles by Alexander Hamilton and James Madison. The articles were divided into 247 blocks of about 200 words each, and the number of instances of various words in each block were recorded. Table 17.2 displays the results for the word ‘may.’

Table 17.2  Observed distribution of the word ‘may’ in papers of Hamilton and Madison, from Mosteller and Wallace (1964). Out of the 247 blocks of Hamilton’s text studied, 128 had no instances of ‘may,’ 67 had one instance of ‘may,’ and so forth, and similarly for Madison.

(a)   Fit the Poisson model to these data, with different parameters for each author and a noninformative prior distribution. Plot the posterior density of the Poisson mean parameter for each author.

(b)   Fit the negative binomial model to these data with different parameters for each author. What is a reasonable noninformative prior distribution to use? For each author, make a contour plot of the posterior density of the two parameters and a scatterplot of the posterior simulations.

3.  Model checking with the Poisson and binomial distributions: we examine the fit of the models in the previous exercise using posterior predictive checks.

(a)   Considering the nature of the data and of likely departures from the model, what would be appropriate test statistics?

(b)   Compare the observed test statistics to their posterior predictive distribution (see Section 6.3) to test the fit of the Poisson model.

(c)   Perform the same test for the negative binomial model.

4.  Robust models and model checking: fit a robust model to Newcomb’s speed of light data (Figure 3.1). Check the fit of the model using appropriate techniques from Chapter 6.

5.  Contamination models: construct and fit a normal mixture model to the dataset used in the previous exercise.

6.  Robust models:

(a)   Choose a dataset from one of the examples or exercises earlier in the book and analyze it using a robust model.

(b)   Check the fit of the model using the posterior predictive distribution and appropriate test variables.

(c)   Discuss how inferences changed under the robust model.

7.  Computation for the t model: consider the model y1, …, yn ~ iid tν(μ, σ2), with ν fixed and a uniform prior distribution on (μ, log σ).

(a)   Work out the steps of the EM algorithm for finding posterior modes of (μ, log σ), using the specification (17.1) and averaging over V1, …, Vn. Clearly specify the joint posterior density, its logarithm, the function Eold log p(μ, log σ, V1, …, Vny), and the updating equations for the M-step.

(b)   Work out the Gibbs sampler for drawing posterior simulations of (μ, log σ, V1, …, Vn).

(c)   Illustrate the analysis with the speed of light data of Figure 3.1, using a t2 model.

8.  Robustness and sensitivity analysis: repeat the computations of Section 17.4 with the dataset altered as described on page 435 so that the observation y8 is replaced by 100. Verify that, in this case, inferences are sensitive to ν. Which values of ν have highest marginal posterior density?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset