Chapter 20

Basis function models

Chapter 19 considered nonlinear models with E(yX, β) = μ(Xi, ), where μ is a prespecified parametric nonlinear function of the predictors with unknown parameters . In this and the following chapters we consider models where μ is also a priori unknown. In later chapters we consider more flexible approaches that also allow the residual density to be unknown and potentially changing with predictors.

20.1   Splines and weighted sums of basis functions

To allow the mean to vary nonlinearly with predictors, one can replace Xiβ with μ(Xi), where μ(·) falls in some class of nonlinear functions. A variety of approaches are available for modeling this μ, including the use of basis function expansions and Gaussian processes (discussed in the following chapter).

To illustrate the basis function approach, we start with one-dimensional regression models in which μ(x) is modeled as a sum,

where is a prespecified set of basis functions and β = (β1, …, βH) is a vector of basis coefficients. The Taylor series expansion is a familiar example in which the basis functions are polynomials of increasing degree, with which one can represent a function as a (possibly) infinite sum of terms. In practice, Taylor series expansions can require a huge number of terms to model a function well globally and, for statistical applications, typically have horrible properties near the boundary. By a more appropriate choice of a finite set of basis functions it should be possible to more accurately model functions that arise in practice. It has been found useful to use local basis functions which are centered on different locations and for which each basis function bh has a centering point xh so that bh(x) dimimishes to zero when x is far from xh.

An often-used simple choice is the family of Gaussian radial basis functions,

where xh are centers of the basis functions and l is a common width parameter. The number of basis functions and the width parameter l controls the scale at which the model can vary as a function of x.

Another commonly used family of basis functions is the B-spline, which is a piecewise continuous function that is defined conditional on some set of knots. Assuming uniform knot locations xh+k = xh + δk, the cubic B-spline basis function is defined as the following piecewise cubic polynomial:

Figure 20.1  Single Gaussian (solid line) and cubic B-spline (dashed line) basis functions scaled to have the same width. The X marks the center of the Gaussian basis function, and the circles mark the location of knots for the cubic B-spline.

Here the width of the basis function is determined by distance δ between knots, and the maximum flexibility of the model is controlled by the number of knots uniformly placed in the data range. Knot locations can also be set nonuniformly. B-splines have a more complex definition than Gaussian radial basis function, but each B-spline basis function has compact support, so the design matrix of the linear model is sparse which can be exploited in computation.

Figure 20.1 shows single Gaussian and cubic B-spline basis functions. Both have smooth bell shapes. A weighted sum of such shapes (in which weights can be positive, negative, or zero) can be used to model smooth functions. Although the basis functions look very similar, Gaussian radial basis function will produce smoother functions as they are infinitely differentiable, while the cubic B-spline is only three times differentiable.

Figure 20.2 shows a set of B-spline basis functions and realizations from the model obtained by sampling random weights βh from a Gaussian prior distribution for these weights. The number of splines H impacts the flexibility of the resulting model for μ(x), as one cannot characterize finer scale features in μ(x) than the splines chosen. For example, if there is a spike in μ(x) that is narrower than the basis functions in Figure 20.2, then that spike will be oversmoothed.

Conditionally on the selected basis b, the model is linear in the parameters. Hence, we can simply re-express the model as . Because the resulting model is linear in the parameters β, model fitting can proceed as in linear regression models. For example, a multivariate normal-inverse-2 prior for (β, σ2) is conjugate so that the posterior of (β, σ2) given the data (xi, yi)i=1n is also multivariate normal-inverse-2. However, even though the model is linear in the parameters β, a rich class of functions can be accurately approximated as linear combination of basis functions. It is often useful to center the basis function model to linear model

Figure 20.2  (a) A set of cubic B-splines with equally spaced knots. (b) A set of random draws from the B-spline prior for μ(x) based on the basis functions in the left graph, assuming independent standard normal priors for the basis coefficients.

Figure 20.3  A small dataset of concentration of chloride over time in a biology experiment. Data points are circles, the linear regression estimate is shown with a dotted line, and the posterior mean curve using B-splines is the curved solid line.

Example. Chloride concentration

We illustrate with a small dataset from a biology experiment containing 54 measurements of the concentration of chloride taken over a short time interval; see Figure 20.3, which shows raw data, a fitted straight line regression, and the posterior mean of the regression function (that is, E(μ(x)y)) as a function of x, averaging over the posterior distribution of the parameters β) from a fitted B-spline model. The data are close to linear but with some notable local deviations. In this case, there are 21 coefficients to be estimated (β1, …, β21) but only 54 data points so it becomes problematic to estimate all of the basis coefficients without incorporating prior information. There are a variety of strategies that can be taken to accommodate such data sparsity. One possibility is to center the nonparametric prior for the curve μ(x) on a parametric function, such as a linear model.

Let and σ2 ~ Inv-gamma(a0, b0). This implies that the prior expectation for the curve at predictor value x is . Supposing that , so that the prior mean is linear, we can use least squares to estimate the values of β0 producing μ0(x) as close as possible to α + ψ; we find that for H = 21 in this application, μ0(x) is indistinguishable from α + ψx using this approach. For simplicity, one can plug in the least squares estimates for α and ψ. The resulting posterior mean is

with the estimated least squares regression line and W = (wi, …, wn)T. The posterior mean will be shrunk towards the linear regression estimate, addressing the data sparsity issue while allowing nonparametric deviations from the linear regression fit. For a more complete analysis, one can place hyperpriors on α, ψ or can choose a smoothing prior which favors similar values for adjacent basis coefficients; first-order autoregressive priors are often used leading to Bayesian penalized (P) splines.

In applying splines, an important aspect of the specification is the number of knots and their locations. In many applications, it works well to choose sufficiently many knots, such as H = 21 in the above example, while also carefully choosing the prior for the basis coefficients to limit problems with over-fitting and data sparsity. However, several different Bayesian approaches are available for accommodating uncertainty in basis function specification. The first is to consider a free knot approach with a prior on the number and locations of knots in a kernel or spline model, using reversible jump MCMC (see Section 12.3) for posterior computation. The resulting posterior distribution for (μ, σ) will allow for uncertainty through model averaging over the posterior for the number and locations of knots. Although this approach is conceptually appealing, the computational implementation is a major hurdle. In particular, designing efficient reversible jump algorithms can be challenging. A second possibility is to relax the variable selection problem by choosing priors that do not set the βh coefficients exactly equal to zero but instead shrink many of the coefficients to near-zero values, while having heavy tails to avoid over-shrinking the coefficients for the important basis functions. A shrinkage prior that is concentrated near zero with heavy tails can be thought to provide a continuous analogue to variable selection priors, with the shrinkage priors having conceptual and computational advantages by not having to jump discontinuously between zero and nonzero values.

In this chapter, we first describe the variable selection approach, including basic details for how to proceed with posterior computation and inferences. We will then outline methods for shrinkage, which have some practical advantages over the formal variable selection approach. Extensions to multivariate regression with p > 1 will be described, and we will provide an introduction to the use of Gaussian process priors as an alternative to explicit basis representations.

20.2   Basis selection and shrinkage of coefficients

Focusing on the nonparametric regression model with Gaussian residuals and letting b = {bh}h=1H denote a prespecified collection of potential basis functions, we have

In practice, there is typically uncertainty in which basis functions are really needed. To allow basis functions to be excluded from the model using Bayesian variable selection, we introduce a model index denoting that basis function bh should be included and γh = 0 otherwise. Here, γ is a model space containing the 2H possibilities for γ ranging from exclusion of all basis functions (denoted γ = 0H) to inclusion of all basis functions (denoted γ = 1H).

To complete a Bayesian specification, we require a prior over the model space γ as well as a prior for the nonzero coefficients in each model γ. A simple prior specification relies on embedding all of the models in the list γ in the full model by letting

with δ0 denoting a degenerate distribution with all its mass at zero. Prior (20.3) sets βh = 0 with probability πh, and otherwise draws a nonzero coefficient from a prior. This implies γh ~ Bernoulli(1 - πh) independently for , with the number of basis functions in model . Model (20.3) can be called a variable selection mixture prior.

In the absence of prior knowledge that certain basis coefficients are more likely to be included, one can let πh = π and then choose a hyperprior π ~ Beta(aπ, bπ) to allow the data to inform more strongly about the model size. Such a prior also induces an automatic Bayesian multiplicity adjustment, which leads to an increasing tendency to set coefficients to zero the more unnecessary basis coefficients are added. This adjustment is clear from the full conditional posterior distribution for π, which has the simple form π- ~ Beta(aπ + σh(1 - γh), bπ + σh γh). To induce a heavy-tailed Cauchy prior for the coefficients for the basis functions that are included, let κh ~ Gamma(0.5, 0.5) independently for h = 1, …, H. This relies on the expression of the t distribution as a scale mixture of normal densities, with an inverse-gamma mixing prior on the variance. An improper prior should not be chosen for the nonzero regression coefficients, as this leads to high posterior probability on the null model excluding all the basis functions (see Section 7.4). However, an improper noninformative prior can be chosen for the variance by letting a, b → 0, as σ is a parameter common to all the possible models.

A convenient characteristic of the above prior specification is that, assuming fixed π and κh = κ for simplicity, the full joint posterior distribution is conjugate with the posterior model probabilities available analytically as

where p(yX, γ) is the marginal likelihood of the data under model γ,

with . This marginal likelihood under model γ is simply the marginal likelihood for a normal linear regression model under a jointly conjugate multivariate normal-gamma prior, and hence an analytic form is available. In addition, the posterior distribution of βγ, σ2 given γ is multivariate normal-inverse-gamma.

Unfortunately, even though the posterior is available analytically, the posterior probability of model γ cannot be calculated unless the number of potential basis functions H is small since there is otherwise an enormous number (2H) of different models to sum across in the denominator of (20.4). For example, when H = 50 there are 250 = 1.1 × 1015 models under consideration. Hence, except in low-dimensional cases, approximations must be used. One possibility is to rely on an MCMC-based stochastic search algorithm to identify high posterior probability models in γ, and model-average across these models.

Another possibility is to use Gibbs sampling to update γh from its Bernoulli full conditional posterior distribution given , with

which can be calculated easily. One cycle of the Gibbs sampler would update γh given γ(-h) for h = 1, …, H. This would be repeated for a large number of iterations. After warm-up to allow convergence, the samples represent posterior draws over the model space γ.

From these draws, one can potentially conduct model selection in order to obtain a simplified model that discards unnecessary basis functions. Under a 0-1 loss function, which assigns a loss of 1 if the incorrect model is selected and 0 otherwise, the Bayes optimal model corresponds to the γ having the highest posterior probability. Unfortunately, unless H is small, it tends to be the case that there is a large number of models having similar posterior model probabilities to the highest posterior probability model, so that it is misleading to basis inferences on any selected model. For this reason, model averaging across the posterior on γ is preferred to better represent uncertainty in basis selection in estimating posterior summaries of the regression function μ and in conducting predictions. If there is interest in selecting a single model, then a better alternative to the maximum posterior model may be the median probability model that includes all predictors (basis functions) having marginal inclusion probabilities Pr(γh = 1data) > 0.5. This model provides the best single model approximation to Bayes model averaging for orthogonal basis functions.

Example. Chloride concentration (continued)

We repeated the above analysis of the chloride data using Bayesian variable selection to account for uncertainty in the B-spline basis functions that are needed to characterize the curve. If all 21 basis functions are included and a weakly informative N(0, I) or N(0, 22I) prior was used for the basis coefficients, we obtained an extremely poor fit, with the posterior mean curve dramatically shifted downwards away from the data and towards the horizontal line at zero where the prior is centered. We considered the model with all 21 basis functions as the full model, and assigned each basis function a prior inclusion probability of 0.5. The coefficients for the basis functions that are included were given independent N(0, 22) priors, while the residual variance σ2 was independently assigned an Inv-gamma(1, 1) prior. With this specification, we implemented a Gibbs sampling algorithm to sample from the full conditional distributions of each of the βh’s and σ, with a different subset of the βh’s automatically assigned to zero at each iteration. We ran the Gibbs sampler to approximate convergence; all of this took only a few seconds in R. The posterior mean for the number of included basis functions is 12.0 with a 95% posterior interval of [8.0, 16.0]. The posterior mean of the residual standard deviation is = 0.27 with 95% interval [0.23, 0.33], suggesting that the measurement error variance is small.

A potential drawback to using Bayesian variable selection to account for uncertainty in selection from among a prespecified collection of basis functions is that there may be some sensitivity to the initial choice of basis. For example, using H = 21 prespecified cubic B-splines conveys some implicit prior information that the curve is quite smooth, and there are not sharp changes and spikes; in many applications, this is well justified but when spike functions are expected a priori one may want to use wavelets or another choice of basis. One can potentially include multiple types of basis functions in the initial collection of potential basis functions, with Bayesian variable selection used to select the subset of basis functions doing the best job at parsimoniously characterizing the curve. However, whenever possible prior information should strongly inform basis choice as well as the choice of prior on the coefficients.

Shrinkage priors

Allowing basis functions to drop out of the model adaptively by allowing their coefficients to be zero with positive probability is conceptually appealing, but comes with a computational price. When the number of models 2H in γ is enormous, MCMC algorithms cannot realistically be said to converge in that only a small percentage of the models will be visited even in several hundreds of thousands of iterations. In addition, there can be slow mixing due to the one at a time updating of the elements of γ. Although block updating is possible, the size of the blocks is limited due to computational constraints. In addition, when nonconjugate priors are used efficient computation becomes even more difficult. These problems did not arise in the applications to the chloride data; indeed the computation time was much less than a minute for enough MCMC iterations that the mixing was excellent in every case we considered. Nonetheless, issues may arise in considering extensions to accommodate multiple predictors.

One possible solution to this problem, which also has philosophical appeal, is to avoid setting any of the coefficients equal to exactly zero but instead use a regularization or shrinkage prior such as discussed in Section 14.6. An appropriate prior would have high density at zero, corresponding to basis functions that can be effectively excluded as their coefficients are close to zero, while having heavy tails to avoid over-shrinking the signal. Most useful shrinkage priors can be expressed as scale mixtures of Gaussians as follows:

with G corresponding to a mixture distribution for the variances. For example, one can obtain a t distribution centered at 0 with ν degrees of freedom by setting G = Inv-gamma(ν/2, ν/2). In the machine learning literature, a common prior for shrinkage of basis coefficient in nonparametric regression corresponds to letting the degrees of freedom ν in the t distribution approach 0. In this limiting case, one obtains a normal-Jeffreys prior. Although the posterior is improper and hence Bayesian inferences are meaningless, the resulting posterior mode σ = (σ1, …, σH) can contain values σh = 0, and the resulting empirical Bayes posterior for βh is concentrated at zero. This induces a type of basis function selection, though uncertainty in selection and estimation of the coefficients is not accommodated.

To obtain a proper posterior and accommodate uncertainty, a common approach is to instead choose ν equal to a small nonzero value, such as ν = 10-6. For ν > 0, the posterior mode will not be exactly zero but the posterior for βh can still be concentrated at zero for unnecessary basis functions as long as the number of degrees of freedom is sufficiently small. The commonly used default of ν = 1, corresponding to a Cauchy prior, often yields good performance in estimating the function μ and performing predictions.

The class of scale mixture of normal distributions also includes the Laplace (double exponential) prior that is related to the lasso method (Section 14.6). The Laplace prior induces sparsity in the posterior mode, in that can be exactly equal to zero, and it is the prior having heaviest tails which still produces a computationally convenient unimodal posterior density (assuming also log-concave likelihood). However, none of the draws from the posterior distribution will be equal to zero and in many cases the Laplace prior does not have enough heavy tails to not overshrink the nonzero coefficients.

An alternative is to use a generalized double Pareto prior distribution on the regression coefficients, which resembles the double exponential near the origin while having arbitrarily heavy tails. The density has the form,

where ξ > 0 is a scale parameter and α > 0 is a shape parameter. One can sample from the generalized double Pareto by instead drawing β ~ N(0, σ2), σ ~ Expon(λ2/2), and λ ~ Gamma(α, η) where ξ = η/α. Hence the generalized double Pareto also admits an interpretation as scale mixture of normal representation and thus retains the computational convenience associated with such mixtures. As a typical default specification for the hyperparameters, one can let α = η = 1, which leads to Cauchy-like tails. Using the generalized double Pareto density as a shrinkage prior on the basis coefficients in nonparametric regression, we let

which is equivalent to , with . Placing the prior p(σ) ∝ 1/σ on the error variance, we then obtain a simple block Gibbs sampler having the following conditional posterior distributions:

where .

This Gibbs sampler tends to have good convergence and mixing properties in our experience, perhaps due largely to the block updating of β. After convergence, one can obtain draws from the posterior distribution for the nonparametric regression curve μ(x), which is expressed as a linear combination of basis functions, with the coefficients on the basis functions shrunk towards zero via the generalized double Pareto prior. In high-dimensional settings involving large numbers of potential basis functions, the tendency will be to set the coefficients for many of these bases close to zero while not shrinking the coefficients for the more important bases much at all. For nonorthogonal bases in which there is some redundancy, the specific bases having coefficients away from a small neighborhood of zero may vary across the iterations.

20.3   Non-normal models and multivariate regression surfaces

Other error distributions

The above discussion has focused on continuous response variables yi with Gaussian distributed residuals. It is straightforward to modify the methods to accommodate heavier-tailed residual densities that allow outliers by instead using a scale mixture of normals. In particular, we could let

which induces a tν distribution for the residual density. For low ν, the t density is substantially heavier-tailed than the normal density, automatically downweighting the influence of outliers on the posterior distribution of μ(x) without needing to discard outlying points. The inverse-gamma scale mixture of normals representation of the residual density is highly convenient in terms of posterior computation; we can simply modify the MCMC code developed in the Gaussian residual case to include an additional step for sampling from the inverse-gamma conditional posterior distribution of i while also modifying the other sampling steps to replace σ2 with iσ2. We can additionally include a Metropolis-Hastings step to allow unknown degrees of freedom ν, or simply fix it in advance at an elicited value.

Example. Chloride concentration (continued)

To assess robustness to outliers, we randomly contaminated one of the observations in the chloride data by adding a normal random variable having 10 times the standard deviation of the residual estimated in the above analysis; the contaminated observation was y47 = 32.4. Rerunning the Gibbs sampler that accounts for uncertainty in basis function selection while assuming Gaussian residuals, the posterior mean of σ increased from = 0.27 to = 0.61, and the posterior intervals around the curve were substantially wider, but the estimate did not change appreciatively. However, repeating this exercise using 100 times the standard deviation to obtain y40 = 46.3, the results were poor with = 22.4 and the estimated curve pulled dramatically towards the horizontal line at zero and away from the data. Repeating the analysis allowing t residuals with degrees of freedom fixed at 4, we obtain results that were quite close to the results for the Gaussian model applied to uncontaminated data, with the curve estimate only slightly pulled up and posterior intervals only slightly wider.

One can allow outcomes in the exponential family by relying on the same framework as above but with ηi = wiβ as the linear predictor in a generalized linear model (see Chapter 16). In non-Gaussian cases, the marginal likelihoods needed for posterior computation will not in general be available analytically, and it is common in practice to rely on Laplace approximations. An alternative strategy that can be used for categorical responses in probit models is to rely on data augmentation incorporating a latent Gaussian continuous response, with augmented data marginal likelihoods available in closed form for the latent variables. Such an approach would add a step to the MCMC algorithm for sampling the underlying Gaussian variables from their full conditional posterior distributions.

Multivariate regression surfaces

Until this point, we have focused on regression models with a single predictor. In considering generalizations to accommodate multiple predictors, one must keep in mind the curse of dimensionality. This curse can arise in two ways. Firstly, computational methods that work for a single predictor or a small number of predictors may not scale well as predictors are added. This is certainly the case if one attempts to prespecify sufficiently many potential basis functions to characterize an unconstrained multivariate regression surface, and then rely on Bayesian variable selection or shrinkage to effectively remove the unnecessary bases. For more than a few predictors, the number of basis functions needed may increase significantly and rapidly become prohibitive. A second problem for multiple predictors is that, even putting aside any computational issues, it may be necessary to have enormous amounts of data to reliably estimate a multivariate regression surface without parametric assumptions, substantial prior information or some restrictions. As the number of predictors p increases for a given sample size n, observations become much more sparsely distributed across the domain of the predictors -p and hence there are typically subregions of - having few observations.

In such settings, the choice of prior for μ is crucial in developing Bayesian approaches for producing accurate interpolations across sparse data regions. One commonly used approach is to assume additivity so that the multivariate regression surface μ mapping from the predictor space - to the real numbers is characterized as a sum of univariate regression functions as follows:

where βj(·) is an unknown coefficient function for the jth predictor, which is expressed as a linear combination of a prespecified set of basis functions bj = {bjh}h=1Hj. For example, bj may correspond to B-splines as above. Focusing on the additive model case, Bayesian variable selection or shrinkage priors can be applied exactly as described above in the p = 1 case without complications. In particular, MCMC algorithms permit a divide and conquer approach using a modified response variable yi* that subtracts the contributions of the intercept and other predictors in updating the unknowns characterizing the regression function for the jth predictor.

Additive models can often reduce the curse of dimensionality, and efficiency can be further improved by including prior information. In many applications, prior information takes the form of shape constraints on the regression function. For example, it may be reasonable to assume a priori that the mean of the response variable is nondecreasing in one or more of the predictors leading to a nondecreasing constraint on certain βj(xj) functions in the additive expansion. Such constraints are easy to incorporate within a Bayesian approach by using piecewise linear or monotone splines bj and then constraining the regression coefficients to be nonnegative. One way to encourage sparse models is by using a prior distribution that is a mixture of a point mass at zero and a truncated normal distribution on the θjh’s, leading to nondecreasing βj functions that can be flat across regions of the predictor space. Allowing flat functions serves the dual purpose of reducing bias in overestimating the slope of the regression function and permitting inferences on regions across which the predictor has no impact. As the Bayesian approach leads to uncertainty in the locations of these flat regions, one can use such models to estimate posterior distributions of threshold predictor levels corresponding to the first value such that there is an increase in the response mean. More involved shape restrictions, such as unimodality and convexity, can also be incorporated through an appropriate prior.

Example. A nonparametric regression function that is constrained to be nondecreasing

Data from pregnant women in the U.S. Collaborative Perinatal Project were used to study the impact of DDE, a persistent metabolite of the pesticide DDT, on the risk of premature delivery. Out of 2380 pregnancies in the dataset, there were 361 preterm births. Serum DDE concentration in mg/L was measured for each woman, along with potentially confounding maternal characteristics including cholesterol and triglyceride levels, age, BMI and smoking status (yes or no). The aim of our analysis was to incorporate a nondecreasing constraint on the regression function relating level of DDE to the probability of preterm birth in order to improve efficiency in assessing the dose response trend. As for other potentially adverse environmental exposures, it was reasonable to believe a priori that covariate-adjusted premature delivery risk is nondecreasing in dose of DDE. Without restricting the curve to be nondecreasing, one relies too much on the data and may obtain artifactual bumps that are not believable. However, we also do not want to impose a strictly increasing relationship a priori, as there may be no impact at risk at low doses or even across the whole range observed in the study.

We focused on the following semiparametric probit additive model:

where yi is an indicator of preterm birth, xi is DDE level, is a vector of the five predictor in the order listed above, and () is the standard normal cumulative distribution function. The covariate adjustment is parametric, while f(x) is characterized nonparametrically as a nondecreasing but potentially flat curve using splines with a carefully structured prior on the basis coefficients. We chose diffuse N(0, 102) priors independently for the α values, though a more informative prior could easily be elicited in this application, as premature delivery studies are routinely connected, with similar covariates measured in different studies. For f(x), we simply used a piecewise linear function with a dense set of knots and βj representing the slope within the jth interval. By choosing a prior for the βj’s that does not allow negative values, we enforce the nondecreasing constraint.

Figure 20.4  Estimated probability of preterm birth as a function of DDE dose. The solid line is the posterior mean based on a Bayesian nonparametric regression constrained to be nondecreasing, and the dashed lines are 95% posterior intervals for the probability at each point. The dotted line is the maximum likelihood estimate for the unconstrained generalized additive model.

In order to borrow information across the adjacent intervals, while enforcing the constraint and placing positive probability at βj=0 to accommodate flat regions, we use a latent threshold prior. In particular, we defined a first-order normal random walk autoregressive prior for latent slope parameters, , with σβ2 assigned an inverse-gamma hyperprior to allow the data to inform about the level of smoothness. Then, to link the latent βj*’s to the slopes characterizing the function, we let βj = 1βjδβj*, with δ a small positive threshold parameter, which is assigned a gamma hyperprior. As δ increases, it becomes more likely to sample βj=0 and the resulting function has more flat regions.

We implemented an MCMC algorithm for posterior computation. The estimated curve f(x) is shown in Figure 20.4. The estimated posterior probability of the global null hypothesis that the curve f(x) is 0 across the observed range of DDE in the sample was less than 0.01, in contrast to the results obtained fitting the same model using a simple classical approach with no constraints, which led to a p-value of 0.23. Using the Bayesian posterior simulations, we also estimated the first dose level at which there is an increasing slope. The posterior mean for this threshold dose is = 7 with a 95% interval of [3, 21]. Such thresholds are of broad interest in many applications, but we recognize that they are approximations, given that the underlying function is presumably continuous and always increasing to some extent.

Although appealing in making the class of possible multivariate regression functions μ(x) more manageable, the additivity assumption is clearly violated in many applications. For example, violations of additivity arise when there are interactions among the predictors, so that the shape and slope of the regression function in the jth predictor depends on the values for other predictors. One alternative approach, which also attempts to address the curse of dimensionality, relies on tensor product specifications, such as

where bj = {bjh} is a prespecified set of basis functions for the jth predictor, as in (20.5) above but with Hj = H for simplicity, and θ = (θh1hp) is a p-way array (tensor) containing unknown coefficients. The number of coefficients in the tensor θ can be large, particularly as p grows. However, we can use Bayesian variable selection or shrinkage priors to favor many elements of θ close to zero. This enables effective collapsing on a lower-dimensional representation of the multidimensional regression surface μ. For example, one can drop out predictors entirely or remove interactions. Conditionally on the basis functions, we have linearity, so that efficient computation is possible using Gibbs sampling.

20.4   Bibliographic note

Bishop (2006) provides a useful review of basis function models. Some key references on the reversible jump MCMC approach to basis function selection in nonparametric regression include Biller (2000) and DiMatteo, Genovese, and Kass (2001). The Bayesian variable selection approach to allowing uncertainty in basis selection was suggested by Smith and Kohn (1996). This approach applies stochastic search for posterior computation in Bayesian variable selection problems (George and McCulloch, 1993, 1997). Justification for median probability models is provided in Barbieri and Berger (2004).

Park and Casella (2008) discuss inference using the Laplace prior distribution using posterior simulations, and Seeger (2008) considers expectation propagation for this model. References on generalized double Pareto shrinkage include Armagan, Dunson and Lee (2011) and Armagan et al. (2013).

Some references on monotone Bayesian nonparametric regression include Ramsay and Silverman (2005), Neelon and Dunson (2004), Dunson (2005), and Hazelton and Turlach (2011), with Hannah and Dunson (2011) recently developing efficient methods for multivariate convex regression. Pati and Dunson (2011) use tensor product nonparametric regression for surface estimation.

The chloride example comes from Bates and Watts (1988). The DDE study comes from Neelon and Dunson (2004).

20.5   Exercises

1.  Basis function model: The file at naes04.csv contains age, sex, race, and attitude on three gay-related questions from the 2004 National Annenberg Election Survey. The three questions are whether the respondent favors a constitutional amendment banning same-sex marriage, whether the respondent supports a state law allowing same-sex marriage, and whether the respondent knows any gay people. Figure 20.5 shows the data for the latter two questions (averaged over all sex and race categories).

For this exercise, you will only need to consider the outcome as a function of age, and for simplicity you should use the normal approximation to the binomial distribution for the proportion of Yes responses for each age.

(a)   Set up a Bayesian basis function model to estimate the percentage of people in the population who believe they know someone gay (in 2004), as a function of age. Write the model in statistical notation (all the model, including prior distribution), and write the (unnormalized) joint posterior density. As noted above, use a normal model for the data.

(b)   Program the log of the unnormalized joint posterior density as an R function.

(c)   Fit the model. You can use MCMC, variational Bayes, expectation propagation, Stan, or any other method. But your fit must be Bayesian.

(d)   Graph your estimate along with the data (plotting multiple graphs on a single page).

2.  Basis function model with binary data: Repeat the previous exercise but this time using the binomial model for the Yes/No responses. The computation will be more complicated but your results should be similar. Discuss any differences compared to the results from the previous exercise.

Figure 20.5  Proportion of survey respondents who reported knowing someone gay, and who supported a law allowing same-sex marriage, as a function of age. Can you fit curves through these points using splines or Gaussian processes?

3.  Basis function model with multiple predictors: Repeat the previous exercise but this time estimating the percentage of people in the population who believe they know someone gay (in 2004), as a function of three predictors: age, sex, and race.

4.  Basis function model for binary data: Table 19.1 presents data on the success rate of putts by professional golfers.

(a)   Fit a basis function model for the probability of success (using the binomial likelihood) as a function of distance. Compare results to your solution of Exercise 19.2.

(b)   Use posterior predictive checks to assess the fit of the model.

5.  Hierarchical modeling and splines: The file Pollster_Data.csv gives percentage support for Barack Obama and Obama Romney in a series of opinion polls in the 2012 election campaign. Different polls are conducted by different survey organizations using different modes of interviewing, with different populations and different sample sizes. Estimate a time series of support for each candidate, adjusting for all these factors and smoothing the curve using a spline model for the time pattern and a hierarchical model for polling organation effects and for poll-to-poll variation. Compare to the smoothed average of the unadjusted approval numbers from this series and comment on any differences.

6.  Consider a nonparametric regression model cubic B-spline basis functions, and the basis coefficients βh drawn independently from a generalized double Pareto shrinkage prior.

(a)   For different choices of k, sample and plot realizations from the prior for μ.

(b)   What is the prior expectation for μ(x) and how does it depend on k and x?

(c)   What is the prior variance of μ(x) and how does it depend on k and x?

(d)   Describe a modification of the generalized double Pareto prior to let E(μ(x)) ≈ x and var(μ(x)) ≈ 2 for all x while maintaining the prior independence assumption in the βhs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset