Chapter 1

Probability and inference

1.1   The three steps of Bayesian data analysis

This book is concerned with practical methods for making inferences from data using probability models for quantities we observe and for quantities about which we wish to learn. The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis.

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

1.  Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.

2.  Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.

3.  Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

Great advances in all these areas have been made in the last forty years, and many of these are reviewed and used in examples throughout the book. Our treatment covers all three steps, the second involving computational methodology and the third a delicate balance of technique and judgment, guided by the applied context of the problem. The first step remains a major stumbling block for much Bayesian analysis: just where do our models come from? How do we go about constructing appropriate probability specifications? We provide some guidance on these issues and illustrate the importance of the third step in retrospectively evaluating the fit of models. Along with the improved techniques available for computing conditional probability distributions in the second step, advances in carrying out the third step alleviate to some degree the need to assume correct model specification at the first attempt. In particular, the much-feared dependence of conclusions on ‘subjective’ prior distributions can be examined and explored.

A primary motivation for Bayesian thinking is that it facilitates a common-sense interpretation of statistical conclusions. For instance, a Bayesian (probability) interval for an unknown quantity of interest can be directly regarded as having a high probability of containing the unknown quantity, in contrast to a frequentist (confidence) interval, which may strictly be interpreted only in relation to a sequence of similar inferences that might be made in repeated practice. Recently in applied statistics, increased emphasis has been placed on interval estimation rather than hypothesis testing, and this provides a strong impetus to the Bayesian viewpoint, since it seems likely that most users of standard confidence intervals give them a common-sense Bayesian interpretation. One of our aims in this book is to indicate the extent to which Bayesian interpretations of common simple statistical procedures are justified.

Rather than argue the foundations of statistics—see the bibliographic note at the end of this chapter for references to foundational debates—we prefer to concentrate on the pragmatic advantages of the Bayesian framework, whose flexibility and generality allow it to cope with complex problems. The central feature of Bayesian inference, the direct quantification of uncertainty, means that there is no impediment in principle to fitting models with many parameters and complicated multilayered probability specifications. In practice, the problems are ones of setting up and computing with such large models, and a large part of this book focuses on recently developed and still developing techniques for handling these modeling and computational challenges. The freedom to set up complex models arises in large part from the fact that the Bayesian paradigm provides a conceptually simple method for coping with multiple parameters, as we discuss in detail from Chapter 3 on.

1.2   General notation for statistical inference

Statistical inference is concerned with drawing conclusions, from numerical data, about quantities that are not observed. For example, a clinical trial of a new cancer drug might be designed to compare the five-year survival probability in a population given the new drug to that in a population under standard treatment. These survival probabilities refer to a large population of patients, and it is neither feasible nor ethically acceptable to experiment on an entire population. Therefore inferences about the true probabilities and, in particular, their differences must be based on a sample of patients. In this example, even if it were possible to expose the entire population to one or the other treatment, it is never possible to expose anyone to both treatments, and therefore statistical inference would still be needed to assess the causal inference—the comparison between the observed outcome in each patient and that patient’s unobserved outcome if exposed to the other treatment.

We distinguish between two kinds of estimands—unobserved quantities for which statistical inferences are made—first, potentially observable quantities, such as future observations of a process, or the outcome under the treatment not received in the clinical trial example; and second, quantities that are not directly observable, that is, parameters that govern the hypothetical process leading to the observed data (for example, regression coefficients). The distinction between these two kinds of estimands is not always precise, but is generally useful as a way of understanding how a statistical model for a particular problem fits into the real world.

Parameters, data, and predictions

As general notation, we let θ denote unobservable vector quantities or population parameters of interest (such as the probabilities of survival under each treatment for randomly chosen members of the population in the example of the clinical trial), y denote the observed data (such as the numbers of survivors and deaths in each treatment group), and denote unknown, but potentially observable, quantities (such as the outcomes of the patients under the other treatment, or the outcome under each of the treatments for a new patient similar to those already in the trial). In general these symbols represent multivariate quantities. We generally use Greek letters for parameters, lower case Roman letters for observed or observable scalars and vectors (and sometimes matrices), and upper case Roman letters for observed or observable matrices. When using matrix notation, we consider vectors as column vectors throughout; for example, if u is a vector with n components, then uTu is a scalar and uuT an n × n matrix.

Observational units and variables

In many statistical studies, data are gathered on each of a set of n objects or units, and we can write the data as a vector, . In the clinical trial example, we might label yi as 1 if patient i is alive after five years or 0 if the patient dies. If several variables are measured on each unit, then each yi is actually a vector, and the entire dataset y is a matrix (usually taken to have n rows). The y variables are called the ‘outcomes’ and are considered ‘random’ in the sense that, when making inferences, we wish to allow for the possibility that the observed values of the variables could have turned out otherwise, due to the sampling process and the natural variation of the population.

Exchangeability

The usual starting point of a statistical analysis is the (often tacit) assumption that the n values yi may be regarded as exchangeable, meaning that we express uncertainty as a joint probability density p(y1, …, yn) that is invariant to permutations of the indexes. A nonexchangeable model would be appropriate if information relevant to the outcome were conveyed in the unit indexes rather than by explanatory variables (see below). The idea of exchangeability is fundamental to statistics, and we return to it repeatedly throughout the book.

We commonly model data from an exchangeable distribution as independently and identically distributed (iid) given some unknown parameter vector θ with distribution p(θ). In the clinical trial example, we might model the outcomes yi as iid, given θ, the unknown probability of survival.

Explanatory variables

It is common to have observations on each unit that we do not bother to model as random. In the clinical trial example, such variables might include the age and previous health status of each patient in the study. We call this second class of variables explanatory variables, or covariates, and label them x. We use X to denote the entire set of explanatory variables for all n units; if there are k explanatory variables, then X is a matrix with n rows and k columns. Treating X as random, the notion of exchangeability can be extended to require the distribution of the n values of (x, y)i to be unchanged by arbitrary permutations of the indexes. It is always appropriate to assume an exchangeable model after incorporating sufficient relevant information in X that the indexes can be thought of as randomly assigned. It follows from the assumption of exchangeability that the distribution of y, given x, is the same for all units in the study in the sense that if two units have the same value of x, then their distributions of y are the same. Any of the explanatory variables x can be moved into the y category if we wish to model them. We discuss the role of explanatory variables (also called predictors) in detail in Chapter 8 in the context of analyzing surveys, experiments, and observational studies, and in the later parts of this book in the context of regression models.

Hierarchical modeling

In Chapter 5 and subsequent chapters, we focus on hierarchical models (also called multilevel models), which are used when information is available on several different levels of observational units. In a hierarchical model, it is possible to speak of exchangeability at each level of units. For example, suppose two medical treatments are applied, in separate randomized experiments, to patients in several different cities. Then, if no other information were available, it would be reasonable to treat the patients within each city as exchangeable and also treat the results from different cities as themselves exchangeable. In practice it would make sense to include, as explanatory variables at the city level, whatever relevant information we have on each city, as well as the explanatory variables mentioned before at the individual level, and then the conditional distributions given these explanatory variables would be exchangeable.

1.3   Bayesian inference

Bayesian statistical conclusions about a parameter θ, or unobserved data , are made in terms of probability statements. These probability statements are conditional on the observed value of y, and in our notation are written simply as p(θy) or p(y). We also implicitly condition on the known values of any covariates, x. It is at the fundamental level of conditioning on observed data that Bayesian inference departs from the approach to statistical inference described in many textbooks, which is based on a retrospective evaluation of the procedure used to estimate θ (or ) over the distribution of possible y values conditional on the true unknown value of θ. Despite this difference, it will be seen that in many simple analyses, superficially similar conclusions result from the two approaches to statistical inference. However, analyses obtained using Bayesian methods can be easily extended to more complex problems. In this section, we present the basic mathematics and notation of Bayesian inference, followed in the next section by an example from genetics.

Probability notation

Some comments on notation are needed at this point. First, p·) denotes a conditional probability density with the arguments determined by the context, and similarly for p(·), which denotes a marginal distribution. We use the terms ‘distribution’ and ‘density’ interchangeably. The same notation is used for continuous density functions and discrete probability mass functions. Different distributions in the same equation (or expression) will each be denoted by p(·), as in (1.1) below, for example. Although an abuse of standard mathematical notation, this method is compact and similar to the standard practice of using p(·) for the probability of any discrete event, where the sample space is also suppressed in the notation. Depending on context, to avoid confusion, we may use the notation Pr(·) for the probability of an event; for example, Pr. When using a standard distribution, we use a notation based on the name of the distribution; for example, if θ has a normal distribution with mean μ and variance σ2, we write θ ~ N(μ, σ2) or p(θ) = N(θμ, σ2) or, to be even more explicit, p(θμ, σ2) = N(θμ, σ2). Throughout, we use notation such as N(μ, σ2) for random variables and N(θμ, σ2) for density functions. Notation and formulas for several standard distributions appear in Appendix A.

We also occasionally use the following expressions for all-positive random variables θ: the coefficient of variations defined as sd(θ)/E(θ), the geometric mean is exp(E[log(θ)]), and the geometric standard deviation is exp(sd[log(θ)]).

Bayes’ rule

In order to make probability statements about θ given y, we must begin with a model providing a joint probability distribution for θ and y. The joint probability mass or density function can be written as a product of two densities that are often referred to as the prior distribution p(θ) and the sampling distribution (or data distribution) p(yθ), respectively:

Simply conditioning on the known value of the data y, using the basic property of conditional probability known as Bayes’ rule, yields the posterior density:

where , and the sum is over all possible values of in the case of continuous θ). An equivalent form of (1.1) omits the factor p(y), which does not depend on θ and, with fixed y, can thus be considered a constant, yielding the unnormalized posterior density, which is the right side of (1.2):

The second term in this expression, p(yθ), is taken here as a function of θ, not of y. These simple formulas encapsulate the technical core of Bayesian inference: the primary task of any specific application is to develop the model p(θ, y) and perform the computations to summarize p(θy) in appropriate ways.

Prediction

To make inferences about an unknown observable, often called predictive inferences, we follow a similar logic. Before the data y are considered, the distribution of the unknown but observable y is

This is often called the marginal distribution of y, but a more informative name is the prior predictive distribution: prior because it is not conditional on a previous observation of the process, and predictive because it is the distribution for a quantity that is observable.

After the data y have been observed, we can predict an unknown observable, , from the same process. For example, y = (y1, …, yn) may be the vector of recorded weights of an object weighed n times on a scale, θ = (μ, σ2) may be the unknown true weight of the object and the measurement variance of the scale, and may be the yet to be recorded weight of the object in a planned new weighing. The distribution of is called the posterior predictive distribution, posterior because it is conditional on the observed y and predictive because it is a prediction for an observable :

The second and third lines display the posterior predictive distribution as an average of conditional predictions over the posterior distribution of θ. The last step follows from the assumed conditional independence of y and given θ.

Likelihood

Using Bayes’ rule with a chosen probability model means that the data y affect the posterior inference (1.2) only through p(yθ), which, when regarded as a function of θ, for fixed y, is called the likelihood function. In this way Bayesian inference obeys what is sometimes called the likelihood principle, which states that for a given sample of data, any two probability models p(yθ) that have the same likelihood function yield the same inference for θ.

The likelihood principle is reasonable, but only within the framework of the model or family of models adopted for a particular analysis. In practice, one can rarely be confident that the chosen model is correct. We shall see in Chapter 6 that sampling distributions (imagining repeated realizations of our data) can play an important role in checking model assumptions. In fact, our view of an applied Bayesian statistician is one who is willing to apply Bayes’ rule under a variety of possible models.

Likelihood and odds ratios

The ratio of the posterior density p(θy) evaluated at the points θ1 and θ2 under a given model is called the posterior odds for θ1 compared to θ2. The most familiar application of this concept is with discrete parameters, with θ2 taken to be the complement of θ1. Odds provide an alternative representation of probabilities and have the attractive property that Bayes’ rule takes a particularly simple form when expressed in terms of them:

In words, the posterior odds are equal to the prior odds multiplied by the likelihood ratio, .

1.4   Discrete probability examples: genetics and spell checking

We next demonstrate Bayes’ theorem with two examples in which the immediate goal is inference about a particular discrete quantity rather than with the estimation of a parameter that describes an entire population. These discrete examples allow us to see the prior, likelihood, and posterior probabilities directly.

Inference about a genetic status

Human males have one X-chromosome and one Y-chromosome, whereas females have two X-chromosomes, each chromosome being inherited from one parent. Hemophilia is a disease that exhibits X-chromosome-linked recessive inheritance, meaning that a male who inherits the gene that causes the disease on the X-chromosome is affected, whereas a female carrying the gene on only one of her two X-chromosomes is not affected. The disease is generally fatal for women who inherit two such genes, and this is rare, since the frequency of occurrence of the gene is low in human populations.

Prior distribution. Consider a woman who has an affected brother, which implies that her mother must be a carrier of the hemophilia gene with one ‘good’ and one ‘bad’ hemophilia gene. We are also told that her father is not affected; thus the woman herself has a fifty-fifty chance of having the gene. The unknown quantity of interest, the state of the woman, has just two values: the woman is either a carrier of the gene (θ = 1) or not (θ = 0). Based on the information provided thus far, the prior distribution for the unknown θ can be expressed simply as .

Data model and likelihood. The data used to update the prior information consist of the affection status of the woman’s sons. Suppose she has two sons, neither of whom is affected. Let yi = 1 or 0 denote an affected or unaffected son, respectively. The outcomes of the two sons are exchangeable and, conditional on the unknown θ, are independent; we assume the sons are not identical twins. The two items of independent data generate the following likelihood function:

These expressions follow from the fact that if the woman is a carrier, then each of her sons will have a 50% chance of inheriting the gene and so being affected, whereas if she is not a carrier then there is a probability close to 1 that a son of hers will be unaffected. (In fact, there is a nonzero probability of being affected even if the mother is not a carrier, but this risk—the mutation rate—is small and can be ignored for this example.)

Posterior distribution. Bayes’ rule can now be used to combine the information in the data with the prior probability; in particular, interest is likely to focus on the posterior probability that the woman is a carrier. Using y to denote the joint data (y1, y2), this is simply

Intuitively it is clear that if a woman has unaffected children, it is less probable that she is a carrier, and Bayes’ rule provides a formal mechanism for determining the extent of the correction. The results can also be described in terms of prior and posterior odds. The prior odds of the woman being a carrier are 0.5/0.5 = 1. The likelihood ratio based on the information about her two unaffected sons is 0.25/1 = 0.25, so the posterior odds are 1 · 0.25 = 0.25. Converting back to a probability, we obtain 0.25/(1+ 0.25) = 0.2, as before.

Adding more data. A key aspect of Bayesian analysis is the ease with which sequential analyses can be performed. For example, suppose that the woman has a third son, who is also unaffected. The entire calculation does not need to be redone; rather we use the previous posterior distribution as the new prior distribution, to obtain:

Alternatively, if we suppose that the third son is affected, it is easy to check that the posterior probability of the woman being a carrier becomes 1 (again ignoring the possibility of a mutation).

Spelling correction

Classification of words is a problem of managing uncertainty. For example, suppose someone types ‘radom.’ How should that be read? It could be a misspelling or mistyping of ‘random’ or ‘radon’ or some other alternative, or it could be the intentional typing of ‘radom’ (as in its first use in this paragraph). What is the probability that ‘radom’ actually means random? If we label y as the data and θ as the word that the person was intending to type, then

This product is the unnormalized posterior density. In this case, if for simplicity we consider only three possibilities for the intended word, θ (random, radon, or radom), we can compute the posterior probability of interest by first computing the unnormalized density for all three values of theta and then normalizing:

where θ1=random, θ2=radon, and θ3=radom. The prior probabilities p(θj) can most simply come from frequencies of these words in some large database, ideally one that is adapted to the problem at hand (for example, a database of recent student emails if the word in question is appearing in such a document). The likelihoods p(yθj) can come from some modeling of spelling and typing errors, perhaps fit using some study in which people were followed up after writing emails to identify any questionable words.

Prior distribution. Without any other context, it makes sense to assign the prior probabilities p(θj) based on the relative frequencies of these three words in some databases. Here are probabilities supplied by researchers at Google:

Since we are considering only these possibilities, we could renormalize the three numbers to sum to 1 (p(random) , etc.) but there is no need, as the adjustment would merely be absorbed into the proportionality constant in (1.6).

Returning to the table above, we were surprised to see the probability of ‘radom’ in the corpus being as high as it was. We looked up the word in Wikipedia and found that it is a medium-sized city: home to ‘the largest and best-attended air show in Poland … also the popular unofficial name for a semiautomatic 9 mm Para pistol of Polish design …’ For the documents that we encounter, the relative probability of ‘radom’ seems much too high. If the probabilities above do not seem appropriate for our application, this implies that we have prior information or beliefs that have not yet been included in the model. We shall return to this point after first working out the model’s implications for this example.

Likelihood. Here are some conditional probabilities from Google’s model of spelling and typing errors:

We emphasize that this likelihood function is not a probability distribution. Rather, it is a set of conditional probabilities of a particular outcome (‘radom’) from three different probability distributions, corresponding to three different possibilities for the unknown parameter θ.

These particular values look reasonable enough—a 97% chance that this particular five-letter word will be typed correctly, a 0.2% chance of obtaining this character string by mistakenly dropping a letter from ‘random,’ and a much lower chance of obtaining it by mistyping the final letter of ‘radon.’ We have no strong intuition about these probabilities and will trust the Google engineers here.

Posterior distribution. We multiply the prior probability and the likelihood to get joint probabilities and then renormalize to get posterior probabilities:

Thus, conditional on the model, the typed word ‘radom’ is about twice as likely to be correct as to be a typographical error for ‘random,’ and it is very unlikely to be a mistaken instance of ‘radon.’ A fuller analysis would include possibilities beyond these three words, but the basic idea is the same.

Decision making, model checking, and model improvement. We can envision two directions to go from here. The first approach is to accept the two-thirds probability that the word was typed correctly or even to simply declare ‘radom’ as correct on first pass. The second option would be to question this probability by saying, for example, that ‘radom’ looks like a typo and that the estimated probability of it being correct seems much too high.

When we dispute the claims of a posterior distribution, we are saying that the model does not fit the data or that we have additional prior information not included in the model so far. In this case, we are only examining one word so lack of fit is not the issue; thus a dispute over the posterior must correspond to a claim of additional information, either in the prior or the likelihood.

For this problem we have no particular grounds on which to criticize the likelihood. The prior probabilities, on the other hand, are highly context dependent. The word ‘random’ is of course highly frequent in our own writing on statistics, ‘radon’ occurs occasionally (see Section 9.4), while ‘radom’ was entirely new to us. Our surprise at the high probability of ‘radom’ represents additional knowledge relevant to our particular problem.

The model can be elaborated most immediately by including contextual information in the prior probabilities. For example, if the document under study is a statistics book, then it becomes more likely that the person intended to type ‘random.’ If we label x as the contextual information used by the model, the Bayesian calculation then becomes,

To first approximation, we can simplify that last term to p(yθ), so that the probability of any particular error (that is, the probability of typing a particular string y given the intended word θ) does not depend on context. This is not a perfect assumption but could reduce the burden of modeling and computation.

The practical challenges in Bayesian inference involve setting up models to estimate all these probabilities from data. At that point, as shown above, Bayes’ rule can be easily applied to determine the implications of the model for the problem at hand.

1.5   Probability as a measure of uncertainty

We have already used concepts such as probability density, and indeed we assume that the reader has a fair degree of familiarity with basic probability theory (although in Section 1.8 we provide a brief technical review of some probability calculations that often arise in Bayesian analysis). But since the uses of probability within a Bayesian framework are much broader than within non-Bayesian statistics, it is important to consider at least briefly the foundations of the concept of probability before considering more detailed statistical examples. We take for granted a common understanding on the part of the reader of the mathematical definition of probability: that probabilities are numerical quantities, defined on a set of ‘outcomes,’ that are nonnegative, additive over mutually exclusive outcomes, and sum to 1 over all possible mutually exclusive outcomes.

In Bayesian statistics, probability is used as the fundamental measure or yardstick of uncertainty. Within this paradigm, it is equally legitimate to discuss the probability of ‘rain tomorrow’ or of a Brazilian victory in the soccer World Cup as it is to discuss the probability that a coin toss will land heads. Hence, it becomes as natural to consider the probability that an unknown estimand lies in a particular range of values as it is to consider the probability that the mean of a random sample of 10 items from a known fixed population of size 100 will lie in a certain range. The first of these two probabilities is of more interest after data have been acquired whereas the second is more relevant beforehand. Bayesian methods enable statements to be made about the partial knowledge available (based on data) concerning some situation or ‘state of nature’ (unobservable or as yet unobserved) in a systematic way, using probability as the yardstick. The guiding principle is that the state of knowledge about anything unknown is described by a probability distribution.

What is meant by a numerical measure of uncertainty? For example, the probability of ‘heads’ in a coin toss is widely agreed to be 1/2. Why is this so? Two justifications seem to be commonly given:

1.  Symmetry or exchangeability argument:

assuming equally likely possibilities. For a coin toss this is really a physical argument, based on assumptions about the forces at work in determining the manner in which the coin will fall, as well as the initial physical conditions of the toss.

2.  Frequency argument: probability = relative frequency obtained in a long sequence of tosses, assumed to be performed in an identical manner, physically independently of each other.

Both the above arguments are in a sense subjective, in that they require judgments about the nature of the coin and the tossing procedure, and both involve semantic arguments about the meaning of equally likely events, identical measurements, and independence. The frequency argument may be perceived to have certain special difficulties, in that it involves the hypothetical notion of a long sequence of identical tosses. If taken strictly, this point of view does not allow a statement of probability for a single coin toss that does not happen to be embedded, at least conceptually, in a long sequence of identical events.

The following examples illustrate how probability judgments can be increasingly subjective. First, consider the following modified coin experiment. Suppose that a particular coin is stated to be either double-headed or double-tailed, with no further information provided. Can one still talk of the probability of heads? It seems clear that in common parlance one certainly can. It is less clear, perhaps, how to assess this new probability, but many would agree on the same value of 1/2, perhaps based on the exchangeability of the labels ‘heads’ and ‘tails.’

Now consider some further examples. Suppose Colombia plays Brazil in soccer tomorrow: what is the probability of Colombia winning? What is the probability of rain tomorrow? What is the probability that Colombia wins, if it rains tomorrow? What is the probability that a specified rocket launch will fail? Although each of these questions seems reasonable in a common-sense way, it is difficult to contemplate strong frequency interpretations for the probabilities being referenced. Frequency interpretations can usually be constructed, however, and this is an extremely useful tool in statistics. For example, one can consider the future rocket launch as a sample from the population of potential launches of the same type, and look at the frequency of past launches that have failed (see the bibliographic note at the end of this chapter for more details on this example). Doing this sort of thing scientifically means creating a probability model (or, at least, a ‘reference set’ of comparable events), and this brings us back to a situation analogous to the simple coin toss, where we must consider the outcomes in question as exchangeable and thus equally likely.

Why is probability a reasonable way of quantifying uncertainty? The following reasons are often advanced.

1.  By analogy: physical randomness induces uncertainty, so it seems reasonable to describe uncertainty in the language of random events. Common speech uses many terms such as ‘probably’ and ‘unlikely,’ and it appears consistent with such usage to extend a more formal probability calculus to problems of scientific inference.

2.  Axiomatic or normative approach: related to decision theory, this approach places all statistical inference in the context of decision-making with gains and losses. Then reasonable axioms (ordering, transitivity, and so on) imply that uncertainty must be represented in terms of probability. We view this normative rationale as suggestive but not compelling.

3.  Coherence of bets. Define the probability p attached (by you) to an event E as the fraction (p [0, 1]) at which you would exchange (that is, bet) $p for a return of $1 if E occurs. That is, if E occurs, you gain if the complement of E occurs, you lose For example:

•  Coin toss: thinking of the coin toss as a fair bet suggests even odds corresponding to p = 1/2.

•  Odds for a game: if you are willing to bet on team A to win a game at 10 to 1 odds against team B (that is, you bet 1 to win 10), your ‘probability’ for team A winning is at least 1/11.

The principle of coherence states that your assignment of probabilities to all possible events should be such that it is not possible to make a definite gain by betting with you. It can be proved that probabilities constructed under this principle must satisfy the basic axioms of probability theory.

The betting rationale has some fundamental difficulties:

•  Exact odds are required, on which you would be willing to bet in either direction, for all events. How can you assign exact odds if you are not sure?

•  If a person is willing to bet with you, and has information you do not, it might not be wise for you to take the bet. In practice, probability is an incomplete (necessary but not sufficient) guide to betting.

All of these considerations suggest that probabilities may be a reasonable approach to summarizing uncertainty in applied statistics, but the ultimate proof is in the success of the applications. The remaining chapters of this book demonstrate that probability provides a rich and flexible framework for handling uncertainty in statistical applications.

Subjectivity and objectivity

All statistical methods that use probability are subjective in the sense of relying on mathematical idealizations of the world. Bayesian methods are sometimes said to be especially subjective because of their reliance on a prior distribution, but in most problems, scientific judgment is necessary to specify both the ‘likelihood’ and the ‘prior’ parts of the model. For example, linear regression models are generally at least as suspect as any prior distribution that might be assumed about the regression parameters. A general principle is at work here: whenever there is replication, in the sense of many exchangeable units observed, there is scope for estimating features of a probability distribution from data and thus making the analysis more ‘objective.’ If an experiment as a whole is replicated several times, then the parameters of the prior distribution can themselves be estimated from data, as discussed in Chapter 5. In any case, however, certain elements requiring scientific judgment will remain, notably the choice of data included in the analysis, the parametric forms assumed for the distributions, and the ways in which the model is checked.

1.6   Example of probability assignment: football point spreads

As an example of how probabilities might be assigned using empirical data and plausible substantive assumptions, we consider methods of estimating the probabilities of certain outcomes in professional (American) football games. This is an example only of probability assignment, not of Bayesian inference. A number of approaches to assigning probabilities for football game outcomes are illustrated: making subjective assessments, using empirical probabilities based on observed data, and constructing a parametric probability model.

Figure 1.1  Scatterplot of actual outcome vs. point spread for each of 672 professional football games. The × and y coordinates are jittered by adding uniform random numbers to each point’s coordinates (between -0.1 and 0.1 for the x coordinate; between -0.2 and 0.2 for the y coordinate) in order to display multiple values but preserve the discrete-valued nature of each.

Football point spreads and game outcomes

Football experts provide a point spread for every football game as a measure of the difference in ability between the two teams. For example, team A might be a 3.5-point favorite to defeat team B. The implication of this point spread is that the proposition that team A, the favorite, defeats team B, the underdog, by 4 or more points is considered a fair bet; in other words, the probability that A wins by more than 3.5 points is 1/2. If the point spread is an integer, then the implication is that team A is as likely to win by more points than the point spread as it is to win by fewer points than the point spread (or to lose); there is positive probability that A will win by exactly the point spread, in which case neither side is paid off. The assignment of point spreads is itself an interesting exercise in probabilistic reasoning; one interpretation is that the point spread is the median of the distribution of the gambling population’s beliefs about the possible outcomes of the game. For the rest of this example, we treat point spreads as given and do not worry about how they were derived.

The point spread and actual game outcome for 672 professional football games played during the 1981, 1983, and 1984 seasons are graphed in Figure 1.1. (Much of the 1982 season was canceled due to a labor dispute.) Each point in the scatterplot displays the point spread, x, and the actual outcome (favorite’s score minus underdog’s score), y. (In games with a point spread of zero, the labels ‘favorite’ and ‘underdog’ were assigned at random.) A small random jitter is added to the x and y coordinate of each point on the graph so that multiple points do not fall exactly on top of each other.

Assigning probabilities based on observed frequencies

It is of interest to assign probabilities to particular events: Pr(favorite wins), Pr(favorite wins point spread is 3.5 points), Pr(favorite wins by more than the point spread), Pr(favorite wins by more than the point spread point spread is 3.5 points), and so forth. We might report a subjective probability based on informal experience gathered by reading the newspaper and watching football games. The probability that the favored team wins a game should certainly be greater than 0.5, perhaps between 0.6 and 0.75? More complex events require more intuition or knowledge on our part. A more systematic approach is to assign probabilities based on the data in Figure 1.1. Counting a tied game as one-half win and one-half loss, and ignoring games for which the point spread is zero (and thus there is no favorite), we obtain empirical estimates such as:

Figure 1.2  (a) Scatterplot of (actual outcome — point spread) vs. point spread for each of 672 professional football games (with uniform random jitter added to x and y coordinates). (b) Histogram of the differences between the game outcome and the point spread, with the N(0, 142) density superimposed.

•  Pr(favorite wins) = 410.5/655 = 0.63

•  Pr(favorite wins

•  Pr(favorite wins by more than the point spread) = 308/655 = 0.47

•  Pr(favorite wins by more than the point spread x = 3.5) = 32/59 = 0.54.

These empirical probability assignments all seem sensible in that they match the intuition of knowledgeable football fans. However, such probability assignments are problematic for events with few directly relevant data points. For example, 8.5-point favorites won five out of five times during this three-year period, whereas 9-point favorites won thirteen out of twenty times. However, we realistically expect the probability of winning to be greater for a 9-point favorite than for an 8.5-point favorite. The small sample size with point spread 8.5 leads to imprecise probability assignments. We consider an alternative method using a parametric model.

A parametric model for the difference between outcome and point spread

Figure 1.2a displays the differences y - x between the observed game outcome and the point spread, plotted versus the point spread, for the games in the football dataset. (Once again, random jitter was added to both coordinates.) This plot suggests that it may be roughly reasonable to model the distribution of y - x as independent of x. (See Exercise 6.10.) Figure 1.2b is a histogram of the differences y - x for all the football games, with a fitted normal density superimposed. This plot suggests that it may be reasonable to approximate the marginal distribution of the random variable d = y - x by a normal distribution. The sample mean of the 672 values of d is 0.07, and the sample standard deviation is 13.86, suggesting that the results of football games are approximately normal with mean equal to the point spread and standard deviation nearly 14 points (two converted touchdowns). For the remainder of the discussion we take the distribution of d to be independent of x and normal with mean zero and standard deviation 14 for each x; that is,

as displayed in Figure 1.2b. The assigned probability model is not perfect: it does not fit the data exactly, and, as is often the case with real data, neither football scores nor point spreads are continuous-valued quantities.

Assigning probabilities using the parametric model

Nevertheless, the model provides a convenient approximation that can be used to assign probabilities to events. If d has a normal distribution with mean zero and is independent of the point spread, then the probability that the favorite wins by more than the point spread is 1/2, conditional on any value of the point spread, and therefore unconditionally as well. Denoting probabilities obtained by the normal model as Prnorm, the probability that an x-point favorite wins the game can be computed, assuming the normal model, as follows:

where is the standard normal cumulative distribution function. For example,

•  Prnorm(favorite wins x = 3.5) = 0.60

•  Prnorm(favorite wins x = 8.5) = 0.73

•  Prnorm(favorite wins x = 9.0) = 0.74.

The probability for a 3.5-point favorite agrees with the empirical value given earlier, whereas the probabilities for 8.5- and 9-point favorites make more intuitive sense than the empirical values based on small samples.

1.7   Example: estimating the accuracy of record linkage

We emphasize the essentially empirical (not ‘subjective’ or ‘personal’) nature of probabilities with another example in which they are estimated from data.

Record linkage refers to the use of an algorithmic technique to identify records from different databases that correspond to the same individual. Record-linkage techniques are used in a variety of settings. The work described here was formulated and first applied in the context of record linkage between the U.S. Census and a large-scale post-enumeration survey, which is the first step of an extensive matching operation conducted to evaluate census coverage for subgroups of the population. The goal of this first step is to declare as many records as possible ‘matched’ by computer without an excessive rate of error, thereby avoiding the cost of the resulting manual processing for all records not declared ‘matched.’

Existing methods for assigning scores to potential matches

Much attention has been paid in the record-linkage literature to the problem of assigning ‘weights’ to individual fields of information in a multivariate record and obtaining a composite ‘score,’ which we call y, that summarizes the closeness of agreement between two records. Here, we assume that this step is complete in the sense that these rules have been chosen. The next step is the assignment of candidate matched pairs, where each pair of records consists of the best potential match for each other from the respective databases. The specified weighting rules then order the candidate matched pairs. In the motivating problem at the Census Bureau, a binary choice is made between the alternatives ‘declare matched’ vs. ‘send to followup,’ where a cutoff score is needed above which records are declared matched. The false-match rate is then defined as the number of falsely matched pairs divided by the number of declared matched pairs.

Particularly relevant for any such decision problem is an accurate method for assessing the probability that a candidate matched pair is a correct match as a function of its score. Simple methods exist for converting the scores into probabilities, but these lead to extremely inaccurate, typically grossly optimistic, estimates of false-match rates. For example, a manual check of a set of records with nominal false-match probabilities ranging from 10-3 to 10-7 (that is, pairs deemed almost certain to be matches) found actual false-match rates closer to the 1% range. Records with nominal false-match probabilities of 1% had an actual false-match rate of 5%.

Figure 1.3  Histograms of weight scores y for true and false matches in a sample of records from the 1988 test Census. Most of the matches in the sample are true (because a pre-screening process has already picked these as the best potential match for each case), and the two distributions are mostly, but not completely, separated.

We would like to use Bayesian methods to recalibrate these to obtain objective probabilities of matching for a given decision rule—in the same way that in the football example, we used past data to estimate the probabilities of different game outcomes conditional on the point spread. Our approach is to work with the scores y and empirically estimate the probability of a match as a function of y.

Estimating match probabilities empirically

We obtain accurate match probabilities using mixture modeling, a topic we discuss in detail in Chapter 22. The distribution of previously obtained scores for the candidate matches is considered a ‘mixture’ of a distribution of scores for true matches and a distribution for non-matches. The parameters of the mixture model are estimated from the data. The estimated parameters allow us to calculate an estimate of the probability of a false match (a pair declared matched that is not a true match) for any given decision threshold on the scores. In the procedure that was actually used, some elements of the mixture model (for example, the optimal transformation required to allow a mixture of normal distributions to apply) were fit using ‘training’ data with known match status (separate from the data to which we apply our calibration procedure), but we do not describe those details here. Instead we focus on how the method would be used with a set of data with unknown match status.

Support for this approach is provided in Figure 1.3, which displays the distribution of scores for the matches and non-matches in a particular dataset obtained from 2300 records from a ‘test Census’ survey conducted in a single local area two years before the 1990 Census. The two distributions, p(ymatch) and p(ynon-match), are mostly distinct—meaning that in most cases it is possible to identify a candidate as a match or not given the score alone—but with some overlap.

Figure 1.4  Lines show expected false-match rate (and 95% bounds) as a function of the proportion of cases declared matches, based on the mixture model for record linkage. Dots show the actual false-match rate for the data.

In our application dataset, we do not know the match status. Thus we are faced with a single combined histogram from which we estimate the two component distributions and the proportion of the population of scores that belong to each component. Under the mixture model, the distribution of scores can be written as,

The mixture probability (Pr(match)) and the parameters of the distributions of matches (p(ymatch)) and non-matches (p(ynon-match)) are estimated using the mixture model approach (as described in Chapter 22) applied to the combined histogram from the data with unknown match status.

To use the method to make record-linkage decisions, we construct a curve giving the false-match rate as a function of the decision threshold, the score above which pairs will be ‘declared’ a match. For a given decision threshold, the probability distributions in (1.7) can be used to estimate the probability of a false match, a score y above the threshold originating from the distribution p(ynon-match). The lower the threshold, the more pairs we will declare as matches. As we declare more matches, the proportion of errors increases. The approach described here should provide an objective error estimate for each threshold. (See the validation in the next paragraph.) Then a decision maker can determine the threshold that provides an acceptable balance between the goals of declaring more matches automatically (thus reducing the clerical labor) and making fewer mistakes.

External validation of the probabilities using test data

The approach described above was externally validated using data for which the match status is known. The method was applied to data from three different locations of the 1988 test Census, and so three tests of the methods were possible. We provide detailed results for one; results for the other two were similar. The mixture model was fitted to the scores of all the candidate pairs at a test site. Then the estimated model was used to create the lines in Figure 1.4, which show the expected false-match rate (and uncertainty bounds) in terms of the proportion of cases declared matched, as the threshold varies from high (thus allowing no matches) to low (thus declaring almost all the candidate pairs to be matches). The false-match proportion is an increasing function of the number of declared matches, which makes sense: as we move rightward on the graph, we are declaring weaker and weaker cases to be matches.

Figure 1.5  Expansion of Figure 1.4 in the region where the estimated and actual match rates change rapidly. In this case, it would seem a good idea to match about 88% of the cases and send the rest to followup.

The lines on Figure 1.4 display the expected proportion of false matches and 95% posterior bounds for the false-match rate as estimated from the model. (These bounds give the estimated range within which there is 95% posterior probability that the false-match rate lies. The concept of posterior intervals is discussed in more detail in the next chapter.) The dots in the graph display the actual false-match proportions, which track well with the model. In particular, the model would suggest a recommendation of declaring something less than 90% of cases as matched and giving up on the other 10% or so, so as to avoid most of the false matches, and the dots show a similar pattern.

It is clearly possible to match large proportions of the files with little or no error. Also, the quality of candidate matches becomes dramatically worse at some point where the false-match rate accelerates. Figure 1.5 takes a magnifying glass to the previous display to highlight the behavior of the calibration procedure in the region of interest where the false-match rate accelerates. The predicted false-match rate curves bend upward, close to the points where the observed false-match rate curves rise steeply, which is a particularly encouraging feature of the calibration method. The calibration procedure performs well from the standpoint of providing predicted probabilities that are close to the true probabilities and interval estimates that are informative and include the true values. By comparison, the original estimates of match probabilities, constructed by multiplying weights without empirical calibration, were highly inaccurate.

1.8   Some useful results from probability theory

We assume the reader is familiar with elementary manipulations involving probabilities and probability distributions. In particular, basic probability background that must be well understood for key parts of the book includes the manipulation of joint densities, the definition of simple moments, the transformation of variables, and methods of simulation. In this section we briefly review these assumed prerequisites and clarify some further notational conventions used in the remainder of the book. Appendix A provides information on some commonly used probability distributions.

As introduced in Section 1.3, we generally represent joint distributions by their joint probability mass or density function, with dummy arguments reflecting the name given to each variable being considered. Thus for two quantities u and v, we write the joint density as p(u, v); if specific values need to be referenced, this notation will be further abused as with, for example, p(u, v = 1).

In Bayesian calculations relating to a joint density p(u, v), we will often refer to a conditional distribution or density function such as p(uv) and a marginal density such as In this notation, either or both u and v can be vectors. Typically it will be clear from the context that the range of integration in the latter expression refers to the entire range of the variable being integrated out. It is also often useful to factor a joint density as a product of marginal and conditional densities; for example, p(u, v, w) = p(uv,w)p(vw)p(w).

Some authors use different notations for distributions on parameters and observables—for example, —but this obscures the fact that all probability distributions have the same logical status in Bayesian inference. We must always be careful, though, to indicate appropriate conditioning; for example, p(yθ) is different from p(y). In the interests of conciseness, however, our notation hides the conditioning on hypotheses that hold throughout—no probability judgments can be made in a vacuum—and to be more explicit one might use a notation such as the following:

where H refers to the set of hypotheses or assumptions used to define the model. Also, we sometimes suppress explicit conditioning on known explanatory variables, x.

We use the standard notations, E(·) and var(·), for mean and variance, respectively:

For a vector parameter u, the expression for the mean is the same, and the covariance matrix is defined as

where u is considered a column vector. (We use the terms ‘variance matrix’ and ‘covariance matrix’ interchangeably.) This notation is slightly imprecise, because E(u) and var(u) are really functions of the distribution function, p(u), not of the variable u. In an expression involving an expectation, any variable that does not appear explicitly as a conditioning variable is assumed to be integrated out in the expectation; for example, E(uv) refers to the conditional expectation of u with v held fixed—that is, the conditional expectation as a function of v—whereas E(u) is the expectation of u, averaging over v (as well as u).

Modeling using conditional probability

Useful probability models often express the distribution of observables conditionally or hierarchically rather than through more complicated unconditional distributions. For example, suppose y is the height of a university student selected at random. The marginal distribution p(y) is (essentially) a mixture of two approximately normal distributions centered around 160 and 175 centimeters. A more useful description of the distribution of y would be based on the joint distribution of height and sex: p(male) ≈ p(female) ≈ 1/2, along with the conditional specifications that p(yfemale) and p(ymale) are each approximately normal with means 160 and 175 cm, respectively. If the conditional variances are not too large, the marginal distribution of y is bimodal. In general, we prefer to model complexity with a hierarchical structure using additional variables rather than with complicated marginal distributions, even when the additional variables are unobserved or even unobservable; this theme underlies mixture models, as discussed in Chapter 22. We repeatedly return to the theme of conditional modeling throughout the book.

Means and variances of conditional distributions

It is often useful to express the mean and variance of a random variable u in terms of the conditional mean and variance given some related quantity v. The mean of u can be obtained by averaging the conditional mean over the marginal distribution of v,

where the inner expectation averages over u, conditional on v, and the outer expectation averages over v. Identity (1.8) is easy to derive by writing the expectation in terms of the joint distribution of u and v and then factoring the joint distribution:

The corresponding result for the variance includes two terms, the mean of the conditional variance and the variance of the conditional mean:

This result can be derived by expanding the terms on the right side of (1.9):

Identities (1.8) and (1.9) also hold if u is a vector, in which case E(u) is a vector and var(u) a matrix.

Transformation of variables

It is common to transform a probability distribution from one parameterization to another. We review the basic result here for a probability density on a transformed space. For clarity, we use subscripts here instead of our usual generic notation, p(·). Suppose pu(u) is the density of the vector u, and we transform to v = f(u), where v has the same number of components as u.

If pu is a discrete distribution, and f is a one-to-one function, then the density of v is given by

If f is a many-to-one function, then a sum of terms appears on the right side of this expression for pv(v), with one term corresponding to each of the branches of the inverse function.

If pu is a continuous distribution, and v = f(u) is a one-to-one transformation, then the joint density of the transformed vector is

where J is the determinant of the Jacobian of the transformation u = f-1(v) as a function of v; the Jacobian J is the square matrix of partial derivatives (with dimension given by the number of components of u), with the (i, j)th entry equal to ∂ui/∂vj. Once again, if f is many-to-one, then pv(v) is a sum or integral of terms.

In one dimension, we commonly use the logarithm to transform the parameter space from . When working with parameters defined on the open unit interval, (0, 1), we often use the logistic transformation:

whose inverse transformation is

Another common choice is the probit transformation, is the standard normal cumulative distribution function, to transform from .

1.9   Computation and software

At the time of writing, the authors rely primarily on the software package R for graphs and basic simulations, fitting of classical simple models (including regression, generalized linear models, and nonparametric methods such as locally weighted regression), optimization, and some simple programming. We use the Bayesian inference package Stan (see Appendix C) for fitting most models, but for teaching purposes in this book we describe how to perform most of the computations from first principles. Even when using Stan, we typically work within R to plot and transform the data before model fitting, and to display inferences and model checks afterwards.

Specific computational tasks that arise in Bayesian data analysis include:

•  Vector and matrix manipulations (see Table 1.1)

•  Computing probability density functions (see Appendix A)

•  Drawing simulations from probability distributions (see Appendix A for standard distributions and Exercise 1.9 for an example of a simple stochastic process)

•  Structured programming (including looping and customized functions)

•  Calculating the linear regression estimate and variance matrix (see Chapter 14)

•  Graphics, including scatterplots with overlain lines and multiple graphs per page (see Chapter 6 for examples).

Our general approach to computation is to fit many models, gradually increasing the complexity. We do not recommend the strategy of writing a model and then letting the computer run overnight to estimate it perfectly. Rather, we prefer to fit each model relatively quickly, using inferences from the previously fitted simpler models as starting values, and displaying inferences and comparing to data before continuing.

We discuss computation in detail in Part III of this book after first introducing the fundamental concepts of Bayesian modeling, inference, and model checking. Appendix C illustrates how to perform computations in R and Stan in several different ways for a single example.

Summarizing inferences by simulation

Simulation forms a central part of much applied Bayesian analysis, because of the relative ease with which samples can often be generated from a probability distribution, even when the density function cannot be explicitly integrated. In performing simulations, it is helpful to consider the duality between a probability density function and a histogram of a set of random draws from the distribution: given a large enough sample, the histogram can provide practically complete information about the density, and in particular, various sample moments, percentiles, and other summary statistics provide estimates of any aspect of the distribution, to a level of precision that can be estimated. For example, to estimate the 95th percentile of the distribution of θ, draw a random sample of size S from p(θ) and use the 0.95Sth order statistic. For most purposes, S = 1000 is adequate for estimating the 95th percentile in this way.

Another advantage of simulation is that extremely large or small simulated values often flag a problem with model specification or parameterization (for example, see Figure 4.2) that might not be noticed if estimates and probability statements were obtained in analytic form.

Generating values from a probability distribution is often straightforward with modern computing techniques based on (pseudo)random number sequences. A well-designed pseudorandom number generator yields a deterministic sequence that appears to have the same properties as a sequence of independent random draws from the uniform distribution on [0, 1]. Appendix A describes methods for drawing random samples from some commonly used distributions.

Sampling using the inverse cumulative distribution function

As an introduction to the ideas of simulation, we describe a method for sampling from discrete and continuous distributions using the inverse cumulative distribution function. The cumulative distribution function, or cdf F, of a one-dimensional distribution, p(v), is defined by

The inverse cdf can be used to obtain random samples from the distribution p, as follows. First draw a random value, U, from the uniform distribution on [0, 1], using a table of random numbers or, more likely, a random number function on the computer. Now let v = F-1(U). The function F is not necessarily one-to-one—certainly not if the distribution is discrete—but F-1(U) is unique with probability 1. The value v will be a random draw from p, and is easy to compute as long as F-1(U) is simple. For a discrete distribution, F-1 can simply be tabulated.

For a continuous example, suppose v has an exponential distribution with parameter λ (see Appendix A); then its cdf is , and the value of v for which . Then, recognizing that 1 - U also has the uniform distribution on [0, 1], we see we can obtain random draws from the exponential distribution as - log U/λ. We discuss other methods of simulation in Part III of the book and Appendix A.

Simulation of posterior and posterior predictive quantities

In practice, we are most often interested in simulating draws from the posterior distribution of the model parameters θ, and perhaps from the posterior predictive distribution of unknown observables . Results from a set of S simulation draws can be stored in the computer in an array, as illustrated in Table 1.1. We use the notation s = 1, …, S to index simulation draws; (θs, s) is the corresponding joint draw of parameters and predicted quantities from their joint posterior distribution.

Table 1.1  Structure of posterior and posterior predictive simulations. The superscripts are indexes, not powers.

From these simulated values, we can estimate the posterior distribution of any quantity of interest, such as θ1/θ3, by just computing a new column in Table 1.1 using the existing S draws of (θ, ). We can estimate the posterior probability of any event, such as Pr(1 + 2 > eθ1), by the proportion of the S simulations for which it is true. We are often interested in posterior intervals; for example, the central 95% posterior interval [a, b] for the parameter θj, for which Pr(θj < a) = 0.025 and Pr(θj > b) = 0.025. These values can be directly estimated by the appropriate simulated values of θj, for example, the 25th and 976th order statistics if S = 1000. We commonly summarize inferences by 50% and 95% intervals.

We return to the accuracy of simulation inferences in Section 10.5 after we have gained some experience using simulations of posterior distributions in some simple examples.

1.10   Bayesian inference in applied statistics

A pragmatic rationale for the use of Bayesian methods is the inherent flexibility introduced by their incorporation of multiple levels of randomness and the resultant ability to combine information from different sources, while incorporating all reasonable sources of uncertainty in inferential summaries. Such methods naturally lead to smoothed estimates in complicated data structures and consequently have the ability to obtain better real-world answers.

Another reason for focusing on Bayesian methods is more psychological, and involves the relationship between the statistician and the client or specialist in the subject matter area who is the consumer of the statistician’s work. In many practical cases, clients will interpret interval estimates provided by statisticians as Bayesian intervals, that is, as probability statements about the likely values of unknown quantities conditional on the evidence in the data. Such direct probability statements require prior probability specifications for unknown quantities (or more generally, probability models for vectors of unknowns), and thus the kinds of answers clients will assume are being provided by statisticians, Bayesian answers, require full probability models—explicit or implicit.

Finally, Bayesian inferences are conditional on probability models that invariably contain approximations in their attempt to represent complicated real-world relationships. If the Bayesian answers vary dramatically over a range of scientifically reasonable assumptions that are unassailable by the data, then the resultant range of possible conclusions must be entertained as legitimate, and we believe that the statistician has the responsibility to make the client aware of this fact.

In this book, we focus on the construction of models (especially hierarchical ones, as discussed in Chapter 5 onward) to relate complicated data structures to scientific questions, checking the fit of such models, and investigating the sensitivity of conclusions to reasonable modeling assumptions. From this point of view, the strength of the Bayesian approach lies in (1) its ability to combine information from multiple sources (thereby in fact allowing greater ‘objectivity’ in final conclusions), and (2) its more encompassing accounting of uncertainty about the unknowns in a statistical problem.

Other important themes, many of which are common to much modern applied statistical practice, whether formally Bayesian or not, are the following:

•  a willingness to use many parameters

•  hierarchical structuring of models, which is the essential tool for achieving partial pooling of estimates and compromising in a scientific way between alternative sources of information

•  model checking—not only by examining the internal goodness of fit of models to observed and possible future data, but also by comparing inferences about estimands and predictions of interest to substantive knowledge

•  an emphasis on inference in the form of distributions or at least interval estimates rather than simple point estimates

•  the use of simulation as the primary method of computation; the modern computational counterpart to a ‘joint probability distribution’ is a set of randomly drawn values, and a key tool for dealing with missing data is the method of multiple imputation (computation and multiple imputation are discussed in more detail in later chapters)

•  the use of probability models as tools for understanding and possibly improving data-analytic techniques that may not explicitly invoke a Bayesian model

•  the importance of including in the analysis as much background information as possible, so as to approximate the goal that data can be viewed as a random sample, conditional on all the variables in the model

•  the importance of designing studies to have the property that inferences for estimands of interest will be robust to model assumptions.

1.11   Bibliographic note

Several good introductory books have been written on Bayesian statistics, beginning with Lindley (1965), and continuing through Hoff (2009). Berry (1996) presents, from a Bayesian perspective, many of the standard topics for an introductory statistics textbook. Gill (2002) and Jackman (2009) introduce applied Bayesian statistics for social scientists, Kruschke (2011) introduces Bayesian methods for psychology researchers, and Christensen et al. (2010) supply a general introduction. Carlin and Louis (2008) cover the theory and applications of Bayesian inference, focusing on biological applications and connections to classical methods. Some resources for teaching Bayesian statistics include Sedlmeier and Gigerenzer (2001) and Gelman (1998, 2008b).

The bibliographic notes at the ends of the chapters in this book refer to a variety of specific applications of Bayesian data analysis. Several review articles in the statistical literature, such as Breslow (1990) and Racine et al. (1986), have appeared that discuss, in general terms, areas of application in which Bayesian methods have been useful. The volumes edited by Gatsonis et al. (1993–2002) are collections of Bayesian analyses, including extensive discussions about choices in the modeling process and the relations between the statistical methods and the applications.

The foundations of probability and Bayesian statistics are an important topic that we treat only briefly. Bernardo and Smith (1994) give a thorough review of the foundations of Bayesian models and inference with a comprehensive list of references. Jeffreys (1961) is a self-contained book about Bayesian statistics that comprehensively presents an inductive view of inference; Good (1950) is another important early work. Jaynes (1983) is a collection of reprinted articles that present a deductive view of Bayesian inference that we believe is similar to ours. Both Jeffreys and Jaynes focus on applications in the physical sciences. Jaynes (2003) focuses on connections between statistical inference and the philosophy of science and includes several examples of physical probability.

Gigerenzer and Hoffrage (1995) discuss the connections between Bayesian probability and frequency probabilities from a perspective similar to ours, and provide evidence that people can typically understand and compute best with probabilities that expressed in the form of relative frequency. Gelman (1998) presents some classroom activities for teaching Bayesian ideas.

De Finetti (1974) is an influential work that focuses on the crucial role of exchangeability. More approachable discussions of the role of exchangeability in Bayesian inference are provided by Lindley and Novick (1981) and Rubin (1978a, 1987a). The non-Bayesian article by Draper et al. (1993) makes an interesting attempt to explain how exchangeable probability models can be justified in data analysis. Berger and Wolpert (1984) give a comprehensive discussion and review of the likelihood principle, and Berger (1985, Sections 1.6, 4.1, and 4.12) reviews a range of philosophical issues from the perspective of Bayesian decision theory.

Our own philosophy of Bayesian statistics appears in Gelman (2011) and Gelman and Shalizi (2013); for some contrasting views, see the discussion of that article, along with Efron (1986) and the discussions following Gelman (2008a).

Pratt (1965) and Rubin (1984) discuss the relevance of Bayesian methods for applied statistics and make many connections between Bayesian and non-Bayesian approaches to inference. Further references on the foundations of statistical inference appear in Shafer (1982) and the accompanying discussion. Kahneman and Tversky (1972) and Alpert and Raiffa (1982) present the results of psychological experiments that assess the meaning of ‘subjective probability’ as measured by people’s stated beliefs and observed actions. Lindley (1971a) surveys many different statistical ideas, all from the Bayesian perspective. Box and Tiao (1973) is an early book on applied Bayesian methods. They give an extensive treatment of inference based on normal distributions, and their first chapter, a broad introduction to Bayesian inference, provides a good counterpart to Chapters 1 and 2 of this book.

The iterative process involving modeling, inference, and model checking that we present in Section 1.1 is discussed at length in the first chapter of Box and Tiao (1973) and also in Box (1980). Cox and Snell (1981) provide a more introductory treatment of these ideas from a less model-based perspective.

Many good books on the mathematical aspects of probability theory are available, such as Feller (1968) and Ross (1983); these are useful when constructing probability models and working with them. O’Hagan (1988) has written an interesting introductory text on probability from an explicitly Bayesian point of view.

Physical probability models for coin tossing are discussed by Keller (1986), Jaynes (2003), and Gelman and Nolan (2002b). The football example of Section 1.6 is discussed in more detail in Stern (1991); see also Harville (1980) and Glickman (1993) and Glickman and Stern (1998) for analyses of football scores not using the point spread. Related analyses of sports scores and betting odds appear in Stern (1997, 1998). For more background on sports betting, see Snyder (1975) and Rombola (1984).

An interesting real-world example of probability assignment arose with the explosion of the Challenger space shuttle in 1986; Martz and Zimmer (1992), Dalal, Fowlkes, and Hoadley (1989), and Lavine (1991) present and compare various methods for assigning probabilities for space shuttle failures. (At the time of writing we are not aware of similar contributions relating to the more recent space accident in 2003.) The record-linkage example in Section 1.7 appears in Belin and Rubin (1995b), who discuss the mixture models and calibration techniques in more detail. The Census problem that motivated the record linkage is described by Hogan (1992).

In all our examples, probabilities are assigned using statistical modeling and estimation, not by ‘subjective’ assessment. Dawid (1986) provides a general discussion of probability assignment, and Dawid (1982) discusses the connections between calibration and Bayesian probability assignment.

The graphical method of jittering, used in Figures 1.1 and 1.2 and elsewhere in this book, is discussed in Chambers et al. (1983). For information on the statistical packages R and Bugs, see Becker, Chambers, and Wilks (1988), R Project (2002), Fox (2002), Venables and Ripley (2002), and Spiegelhalter et al. (1994, 2003).

Norvig (2007) describes the principles and details of the Bayesian spelling corrector.

1.12   Exercises

1.  Conditional probability: suppose that if θ = 1, then y has a normal distribution with mean 1 and standard deviation σ, and if θ = 2, then y has a normal distribution with mean 2 and standard deviation σ. Also, suppose Pr(θ = 1) = 0.5 and Pr(θ = 2) = 0.5.

(a)  For σ = 2, write the formula for the marginal probability density for y and sketch it.

(b)   What is Pr(θ = 1y = 1), again supposing σ = 2?

(c)   Describe how the posterior density of θ changes in shape as σ is increased and as it is decreased.

2.  Conditional means and variances: show that (1.8) and (1.9) hold if u is a vector.

3.  Probability calculation for genetics (from Lindley, 1965): suppose that in each individual of a large population there is a pair of genes, each of which can be either x or X, that controls eye color: those with xx have blue eyes, while heterozygotes (those with Xx or xX) and those with XX have brown eyes. The proportion of blue-eyed individuals is p2 and of heterozygotes is 2p(1 - p), where 0 < p < 1. Each parent transmits one of its own genes to the child; if a parent is a heterozygote, the probability that it transmits the gene of type X is 1/2. Assuming random mating, show that among brown-eyed children of brown-eyed parents, the expected proportion of heterozygotes is 2p/(1 + 2p). Suppose Judy, a brown-eyed child of brown-eyed parents, marries a heterozygote, and they have n children, all brown-eyed. Find the posterior probability that Judy is a heterozygote and the probability that her first grandchild has blue eyes.

4.  Probability assignment: we will use the football dataset to estimate some conditional probabilities about professional football games. There were twelve games with point spreads of 8 points; the outcomes in those games were: -7, -5, -3, -3, 1, 6, 7, 13, 15, 16, 20, and 21, with positive values indicating wins by the favorite and negative values indicating wins by the underdog. Consider the following conditional probabilities:

Pr(favorite wins point spread = 8),

Pr(favorite wins by at least 8 point spread = 8),

Pr(favorite wins by at least 8 point spread = 8 and favorite wins).

(a)   Estimate each of these using the relative frequencies of games with a point spread of 8.

(b)   Estimate each using the normal approximation for the distribution of (outcome — point spread).

5.  Probability assignment: the 435 U.S. Congressmembers are elected to two-year terms; the number of voters in an individual congressional election varies from about 50,000 to 350,000. We will use various sources of information to estimate roughly the probability that at least one congressional election is tied in the next national election.

(a)   Use any knowledge you have about U.S. politics. Specify clearly what information you are using to construct this conditional probability, even if your answer is just a guess.

(b)   Use the following information: in the period 1900–1992, there were 20,597 congressional elections, out of which 6 were decided by fewer than 10 votes and 49 decided by fewer than 100 votes.

See Gelman, King, and Boscardin (1998), Mulligan and Hunter (2001), and Gelman, Katz, and Tuerlinckx (2002) for more on this topic.

6.  Conditional probability: approximately 1/125 of all births are fraternal twins and 1/300 of births are identical twins. Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin? (You may approximate the probability of a boy or girl birth as 1/2.)

7.  Conditional probability: the following problem is loosely based on the television game show Let’s Make a Deal. At the end of the show, a contestant is asked to choose one of three large boxes, where one box contains a fabulous prize and the other two boxes contain lesser prizes. After the contestant chooses a box, Monty Hall, the host of the show, opens one of the two boxes containing smaller prizes. (In order to keep the conclusion suspenseful, Monty does not open the box selected by the contestant.) Monty offers the contestant the opportunity to switch from the chosen box to the remaining unopened box. Should the contestant switch or stay with the original choice? Calculate the probability that the contestant wins under each strategy. This is an exercise in being clear about the information that should be conditioned on when constructing a probability judgment. See Selvin (1975) and Morgan et al. (1991) for further discussion of this problem.

8.  Subjective probability: discuss the following statement. ‘The probability of event E is considered “subjective” if two rational persons A and B can assign unequal probabilities to E, PA(E) and PB(E). These probabilities can also be interpreted as “conditional”: PA(E) = P(EIA) and PB(E) = P(EIB), where IA and IB represent the knowledge available to persons A and B, respectively.’ Apply this idea to the following examples.

(a)   The probability that a ‘6’ appears when a fair die is rolled, where A observes the outcome of the die roll and B does not.

(b)   The probability that Brazil wins the next World Cup, where A is ignorant of soccer and B is a knowledgeable sports fan.

9.  Simulation of a queuing problem: a clinic has three doctors. Patients come into the clinic at random, starting at 9 a.m., according to a Poisson process with time parameter 10 minutes: that is, the time after opening at which the first patient appears follows an exponential distribution with expectation 10 minutes and then, after each patient arrives, the waiting time until the next patient is independently exponentially distributed, also with expectation 10 minutes. When a patient arrives, he or she waits until a doctor is available. The amount of time spent by each doctor with each patient is a random variable, uniformly distributed between 5 and 20 minutes. The office stops admitting new patients at 4 p.m. and closes when the last patient is through with the doctor.

(a)   Simulate this process once. How many patients came to the office? How many had to wait for a doctor? What was their average wait? When did the office close?

(b)   Simulate the process 100 times and estimate the median and 50% interval for each of the summaries in (a).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset