Chapter 2

Probability and Statistics Review

In this chapter we briefly review without proofs some definitions and concepts in probability and statistics. Many introductory and more advanced texts can be recommended for review and reference. On introductory probability see e.g. Bean [23], Ghahramani [118], or Ross [232]. Mathematical statistics and probability books at an advanced undergraduate or first year graduate level include e.g. DeGroot and Schervish [64], Freund (Miller and Miller) [201], Hogg, McKean and Craig [146] or Larsen and Marx [170]. Casella and Berger [39] or Bain and Englehart [16] are somewhat more advanced. Durrett [77] is a graduate probability text. Lehmann [172] and Lehmann and Casella [173] are graduate texts in statistical inference.

2.1 Random Variables and Probability

Distribution and Density Functions

The cumulative distribution function (cdf) of a random variable X is FX defined by

FX(x)=P(X  x),      x.

In this book P(·) denotes the probability of its argument. We will omit the subscript X and write F(x) if it is clear in context. The cdf has the following properties:

  1. FX is non-decreasing.
  2. FX is right-continuous; that is,

    lim0+Fx(x+)   =  FX(x),for all x.

  3. limxFX(x)=0andlimxFX(x)=1.

A random variable X is continuous if FX is a continuous function. A random variable X is discrete if FX is a step function.

Discrete distributions can be specified by the probability mass function (pmf) pX(x) = P(X = x). The discontinuities in the cdf are at the points where the pmf is positive, and p(x)=F(x)F(x).

If X is discrete, the cdf of X is

FX(x)=P(Xx)={kx:pX(k)>0}pX(k).

Continuous distributions do not have positive probability mass at any single point. For continuous random variables X the probability density function (pdf) or density of X is fX(x)=FX(x), provided that FX is differentiable, and by the fundamental theorem of calculus

FX(x) = P(X  x) =xfx(t)dt.

The joint density of continuous random variables X and Y is fX,Y (x, y) and the cdf of (X, Y) is

FX,Y(xy) = P(X  xY  y) =yxfX,Y(st)dsdt.

The marginal probability densities of X and Y are given by

fX(x)=fX,Y(x,y)dy;fY (y)=fX,Y(xy)dx.

The corresponding formulas for discrete random variables are similar, with sums replacing the integrals. In the remainder of this chapter, for simplicity fX(x) denotes either the pdf (if X is continuous) or the pmf (if X is discrete) of X.

The set of points {x : fX(x) > 0} is the support set of the random variable X. Similarly, the bivariate distribution of (X, Y) is supported on the set {(x, y) : fX,Y (x, y) > 0}.

Expectation, Variance, and Moments

The mean of a random variable X is the expected value or mathematical expectation of the variable, denoted E[X]. If X is continuous with density f, then the expected value of X is

E[X]=xf(x)dx.

If X is discrete with pmf f(x), then

E[X]={xfX(x)>0}xf(x).

(The integrals and sums above are not necessarily finite. We implicitly assume that E|X|< ∞ whenever E[X] appears in formulas below.) The expected value of a function g(X) of a continuous random variable X with pdf f is defined by

E[g(X)]=g(x)f(x)dx.

Let μX = E[X].Then μX is also called the first moment of X. The rth moment of X is E[Xr]. Hence ifX is continuous,

E[Xr]=xrfX(x)dx.

The variance of X is the second central moment,

Var(X) = E[(X − E[X])2].

The identity E[(X − E[X])2] = E[X2] (E[X])2 provides an equivalent formula for variance,

var(X)=E[X2](E[X])2=E[X2]μ2X.

The variance of X is also denoted by σ2X. The square root of the variance is the standard deviation. The reciprocal of the variance is the precision.

The expected value of the product of continuous random variables X and Y with joint pdf fX,Y is

E[XY ]=xyfX,Y(x,y)dxdy.

The covariance of X and Y is defined by

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]=E[XY]μXμY.

The covariance of X and Y is also denoted by σXY . Note that Cov(X, X) = Var(X). The product-moment correlation is

ρ(X,Y)=Cov(XY )V ar(X)V ar(Y )=σXYσXσY.

Correlation can also be written as

ρ(X,Y)=E[(XμXσX)(YμYσY)].

Two variables X and Y are uncorrelated if ρ(X,Y) = 0.

Conditional Probability and Independence

In classical probability, the conditional probability of an event A given that event B has occurred is

P(A|B)=P(AB)P(B),

where AB = A ∩ B is the intersection of events A and B. Events A and B are independent if P(AB) = P(A)P(B); otherwise they are dependent. The joint probability that both A and B occur can be written

P(AB)=P(A|B)P(B)=P(B|A)P(A).

If random variables X and Y have joint density fX,Y (x, y), then the conditional density of X given Y = y is

fX|Y=y(x)=fX,Y(x,y)fY(y).

Similarly the conditional density of Y given X = x is

fY|X=x(y)=fX,Y(x,y)fX(x).

Thus, the joint density of (X, Y) can be written

fx,y(x,y)=fX|Y=y(x)fy(y)=fY|X=x(y)fx(x).

Independence

The random variables X and Y are independent if and only if

fX,Y(x,y)=fX(x)fY(y)

for all x and y; or equivalently, if and only if FX,Y(x,y)=FX(x)FY(y), for all x and y.

The random variables X1, ... , Xd are independent if and only if the joint pdf f of X1, ... , Xd is equal to the product of the marginal density functions. That is, X1, ... , Xd are independent if and only if

f(x1,...,xd)=dj=1fj(xj)

for all x = (x1, ... , xd)T in ℝd, where fj(xj) is the marginal density (or marginal pmf) of Xj.

The variables {X1, ... ,Xn}are a random sample from a distribution FX if X1, ... , Xn are independently and identically distributed with distribution FX. In this case the joint density of {X1, ... , Xn} is

f(x1,...,xn)=ni=1fX(xi).

If X and Y are independent, then Cov(X,Y) = 0 and p(X,Y) = 0. However, the converse is not true; uncorrelated variables are not necessarily independent. The converse is true in an importantspecial case: if X and Y are normally distributed then Cov(X,Y) = 0 implies independence.

Properties of Expected Value and Variance

Suppose that X and Y are random variables, and a and b are constants. Then the following properties hold (provided the moments exist).

  1. E[aX + b] = aE[X] + b.
  2. E[X + Y] = E[X] + E[Y].
  3. If X and Y are independent, E[XY] = E[X]E[Y].
  4. Var(b) = 0.
  5. Var[aX + b] = a2Var(X).
  6. Var(X + Y) = Var(X) +Var(Y) + 2Cov(X,Y).
  7. If X and Y are independent, Var(X + Y) = Var(X) +Var(Y).

If {X1, ... , Xn} are independent and identically distributed (iid) we have

E[X1 + · · · + Xn] = nμX, Var(X1 + · · · + Xn) = 2X,

so the sample mean ˉX=1nni=1Xi has expected value μX and variance σ2X/n. (Apply properties 2, 7, and 5 above.)

The conditional expected value of X given Y = y is

E[X|Y=y]=xfX|Y=y(x)dx,

if fX|Y=y(x) is continuous.

Two important results are the conditional expectation rule and the conditional variance formula:

E[X]=E[E[X|Y]](2.1)

var(X)=E[var(X|Y)]+var(E[X|Y])(2.2)

See e.g. Ross [233, Ch. 3] for a proof of (2.1, 2.2) and many applications.

2.2 Some Discrete Distributions

Some important discrete distributions are the “counting distributions.” The counting distributions are used to model the frequency of events and waiting time for events in discrete time, for example. Three important counting distributions are the binomial (and Bernoulli), negative binomial (and geometric), and Poisson.

Several discrete distributions including the binomial, geometric, and negative binomial distributions can be formulated in terms of the outcomes of Bernoulli trials. A Bernoulli experiment has exactly two possible outcomes, “success” or “failure.” A Bernoulli random variable X has the probability mass function

P(X = 1) = p, P(X = 0) = 1 − p,

where p is the probability of success. It is easy to check that E[X] = p and Var(X) = p(1 − p). A sequence of Bernoulli trials is a sequence of outcomes X1, X2,... of iid Bernoulli experiments.

Binomial and Multinomial Distribution

Suppose that X records the number of successes in n iid Bernoulli trials with success probability p. Then X has the Binomial(n, p) distribution [abbreviated X ~ Bin(n, p)] with

P(X=x)=(nx)px(1p)nx=n!x!(nx)!p(1p)nx,      x=0,1,...,n.

The mean and variance formulas are easily derived by observing that that the binomial variable is an iid sum of n Bernoulli(p) variables. Therefore

E[X] = np,       Var(X) = np(1 − p).

A binomial distribution is a special case of a multinomial distribution. Suppose that there are k+1 mutually exclusive and exhaustive events A1,... , Ak+1 that can occur on any trial of an experiment, and each event occurs with probability P(Aj) = pj, j = 1, ... , k + 1. Let Xj record the number of times that event Aj occurs in n independent and identical trials of the experiment. Then X = (X1,..., Xk) has the multinomial distribution with joint pdf

f(x1,...,xk)=n!x1!x2!...xk+1!px11px22...pxk+1k+1         0xjn,          (2.3)where    xk+1=nkj=1xj.

Geometric Distribution

Consider a sequence of Bernoulli trials, with success probability p. Let the random variable X record the number of failures until the first success is observed. Then

P(X=x)=p(1p)x,         x=0,1,2....(2.4)

A random variable X with pmf (2.4) has the Geometric(p) distribution [abbreviated X ~ Geom(p)]. If X ~ Geom(p), then the cdf of X is

FX(x)=P(Xx)=1(1p)[x]+1,                 x0,

and otherwise FX(x) = 0. The mean and variance of X are given by

E[X]=1pp;           V ar[X]=1pp2.

Alternative formulation of Geometric distribution

The geometric distribution is sometimes formulated by letting Y be defined as the number of trials until the first success. Then Y = X + 1, where X is the random variable defined above with pmf (2.4). Under this model, we have P(Y = y) = p(1 − p)y−1, y = 1, 2, ... , and

E[Y ] = E[X + 1] =1pp+1=1p;

V ar[Y ] = V ar[X + 1] = V ar[X] =1pp2.

However, as a counting distribution, or frequency model, the first formulation (2.4) given above is usually applied, because frequency models typically must include the possibility of a zero count.

Negative Binomial Distribution

The negative binomial frequency model applies in the same setting as a geometric model, except that the variable of interest is the number of failures until the r success. Suppose that exactly X failures occur before the rth success. If X = x, then the r success occurs on the (x + r)th trial. In the first x + r − 1 trials, there are r − 1 successes and x failures. This can happen (x+r1r1)=(x+r1x) ways, and each way has probability prqx. The probability mass function of the random variable X is given by

P(X=x)=(x+r1r1)prqx,             x=0,1,2,....          (2.5)

The negative binomial distribution is defined for r > 0 and 0 < p < 1 as follows. The random variable X has a negative binomial distribution with parameters (r, p) if

P(X=x)=Γ(x+r)Γ(r)Γ(x+1)prqx,              x=0,1,2,...,            (2.6)

where Γ(⋅) is the complete gamma function defined in (2.8). Note that (2.5)and (2.6) are equivalent when r is a positive integer. If X has pmf (2.6) we will

write X ~ NegBin(r,p). The special case NegBin(r = 1,p) is the Geom(p) distribution.

Suppose that X ~ NegBin(r, p), where r is a positive integer. Then X is the iid sum of r Geom(p) variables. Therefore, the mean and variance of X given by

E[X]=r1pp                    Var[X] = r1pp2,

are simply r times the mean and variance of the Geom(p) variable in (2.4). These formulas are also valid for all r > 0.

Note that like the geometric random variable, there is an alternative formulation of the negative binomial model that counts the number of trials until the rth success.

Poisson Distribution

A random variable X has a Poisson distribution with parameter λ > 0 if the pmf of X is

p(x)=eλλxx!,                  x=0,1,2....

If X ~ Poisson(λ) then

E[X]=λ;              var(X)=λ.

A useful recursive formula for the pmf is p(x+1)=p(x)λx+1,        x=0,1,2,.... The Poisson distribution has many important properties and applications (see e.g. [124, 158, 233]).

Examples

Example 2.1 (Geometric cdf)

The cdf of the geometric distribution with success probability p can be derived as follows. If q = 1− p, then at the points x = 0, 1, 2,... the cdf of X is given by

P(Xx)=k=0pqk=p(1+q+q2++qx)=p(1qx+1)1q=1qx+1.

Alternately,P(Xx)=1P(Xx+1)=1P(first x+1 trials are failures)=  1-qx+1.

Example 2.2 (Mean of the the Poisson distribution)

If X ~ Poisson(λ), then

E[X]=x=0eλλx(x)!=λx=1eλλx1(x1)!=λx=0eλλx(x)!=λ.

The last equality follows because the summand is the Poisson pmf and the total probability must sum to 1.

2.3 Some Continuous Distributions

Normal Distribution

The normal distribution with mean μ and variance σ2 [abbreviated N(μ, σ2)] is the continuous distribution with pdf

f(x)=12πσexp{12(xμσ)2},           <x<.

The standard normal distribution N(0,1) has zero mean and unit variance, and the standard normal cdf is

Φ(z)=z12πet2/2dt,           <z<.

The normal distribution has several important properties. We summarize some of these properties, without proof. For more properties and characterizations see [156, Ch. 13], [210], or [270].

A linear transformation of a normal variable is also normally distributed. If X ~ N(μ, σ) then the distribution of Y = aX + b is N( + b, a2σ2). It follows that if X ~ N(μ, σ), then

Z=Xμσ~N(0,1).

Linear combinations of normal variables are normal; if X1, ... , Xk are independent, Xi ~ N(μi, σ2i ), and a1, ... , ak are constants, then

Y = a1X1 + ··· + akXk

is normally distributed with mean μ=ki=1aiμi and variance σ2=ki=1a2iσ2i.

.

Therefore, if X1, ... , Xn is a random sample (X1, ... , Xn are iid) from a N(μ, σ2) distribution, the sum Y = X1+· · ·+Xn is normally distributed with E[Y] = and Var(Y) = 2. It follows that the sample mean ˉX=Y/n has the N(μ, σ2/n) distribution if the sampled distribution is normal. (In case the sampled distribution is not normal, but the sample size is large, the Central Limit Theorem implies that the distribution of Y is approximately normal. See Section 2.5)

Gamma and Exponential Distributions

A random variable X has a gamma distribution with shape parameter r > 0 and rate parameter λ > 0 if the pdf of X is

f(x)=λrΓ(r)xr1eλx,            x0,                (2.7)

where Γ(r) is the complete gamma function, defined by

Γ(r)=0tr1etdt,                 ro,1,2,....             (2.8)

Recall thatΓ(n) = (n — 1)! for positive integers n.

The notation X ~ Gamma(r, λ) indicates that X has the density (2.7), with shape r and rate λ. If X ~ Gamma(r, λ) then

E[X]=rλ;                V ar(X) =rλ2.

Gamma distributions can also be parameterized by the scale parameter θ = 1 instead of the rate parameter λ. In terms of (r, θ) the mean is and the variance is 2. An important special case of the gamma distribution is r = 1, which is the exponential distribution with rate parameter λ. The Exponential(λ) pdf is

f(x)=λeλx,         x0

If X is exponentially distributed with rate λ [abbreviated X ~ Exp(λ)], then

E[X]=1λ;                V ar(X) =1λ2.

It can be shown that the sum of iid exponentials has a gamma distribution.

If X1, ... , Xr are iid with the Exp(λ) distribution, then Y = X1 +... + Xr has the Gamma(r, λ) distribution.

Chisquare and t

The Chisquare distribution with ν degrees of freedom is denoted by X2(ν). The pdf of a X2(ν) random variable X is

f(x)=1Γ(v/2)2(v/2)x(v/2)1ex/2,             x,v=1,2,...,.

Note that X2(ν) is a special case of the gamma distribution, with shape parameter ν/2 and rate parameter 1/2. The square of a standard normal variable has the X2(1) distribution. If Z1,... , Zν are iid standard normal then Z21 +⋅⋅⋅+ Z2νX2(ν). If X ~X2(ν1) and Y~ X2(ν2) are independent, then X + Y ~ X2(ν1 + ν2). If X~ X2(ν), then

E[X]=v,              var(X)=2v.

The Student’s t distribution [256] is defined as follows. Let Z ~ N(0,1) and V ~ X2(ν). If Z and V are independent, then the distribution of

T=ZV/v

has the Student’s t distribution with ν degrees of freedom, denoted t(ν). The density of a t(ν) random variable X is given by

f(x)=Γ(v+12)Γ(v2)1vπ1(1+x2v)(v+1)/2,   x,v=1,2,...

The mean and variance of X ~ t(ν) are given by

E[X]=0,v>1; Var(X)=vv2, v>2.

In the special case ν = 1 the t(1) distribution is the standard Cauchy distribution. For small ν the t distribution has “heavy tails” compared to the normal distribution. For large ν, the t(ν) distribution is approximately normal, and t(ν) converges in distribution to standard normal as ν → ∞.

Beta and Uniform Distributions

A random variable X with density function

f(x)=Γ(α+β)Γ(α)Γ(β)xα1(1x)β1,  0x1,α>0,β>0. (2.9)

has the Beta(α, β) distribution. The constant in the beta density is the reciprocal of the beta function, defined by

B(α,β)=01tα1 (1t)β1dt=Γ(α)Γ(β)Γ(α+β).

The continuous uniform distribution on (0,1) or Uniform(0,1) is the special case Beta(1,1).

The parameters α and β are shape parameters. When α = β the distribution is symmetric about 1/2. When α ≠ β the distribution is skewed, with the direction and amount of skewness depending on the shape parameters. The mean and variance are

E[X]=αα+β; Var(X) =αβ(α+β)2(α+β+1).

If X ~ Uniform(0, 1) = Beta(1, 1), then E[X]=12  and Var(X)=112.

In Bayesian analysis, a beta distribution is often chosen to model the distribution of a probability parameter, such as the probability of success in Bernoulli trials or a binomial experiment.

Lognormal Distribution

A random variable X has the Lognormal(μ,σ2) distribution [abbreviated X~ LogN(μ, σ2)] if X = eY , where Y ~ N(μ, σ2). That is, log X ~N(μ, σ2). The lognormal density function is

fX(x)=1x2πσe(logxμ)2/(2σ2), x>0.

The cdf can be evaluated by the normal cdf of log X ~ N(μ,σ2), so the cdf of X ~ LogN(μ, σ2) is given by

FX(x)=Φ(logxμσ),      x>0

The moments are

E[Xr]=E[erY]=exp{rμ+12r2σ2},  r>0. (2.10)

The mean and variance are

E[X]=eμ+σ2/2,Var(X)=e2μ+σ2(eσ21).

Examples

Example 2.3(Two-parameter exponential cdf)

The two-parameter exponential density is

f(x)=λeλ(xn),        xn(2.11)

where λ and η are positive constants. Denote the distribution with density function (2.11) by Exp(λ, η). When η = 0 the density (2.11) is exponential with rate λ.

The cdf of the two-parameter exponential distribution is given by

F(x)=ηxλeλ(tη)dt=0xηλeλudu=1eλ(xη),xη.

In the special case η = 0 we have the cdf of the Exp(λ) distribution,

F(x)=1eλx,         x0.

Example 2.4(Memoryless property of the exponential distribution)

The exponential distribution with rate parameter λ has the memoryless property. That is, if X ~ Exp(λ), then

P(X>s+t|X>s)=P(X>t),      for all s,t0.

The cdf of X is F(x) = 1 exp(−λx), x ≥ 0 (see Example 2.3). Therefore, for all s, t ≥ 0 we have

P(X>s+t|X>s)=P(X>s+t)P(X>s)=1F(s+t)1F(s)                                           =eλ(s+t)eλs=eλt=1F(t)                                           =P(X>t).    

The first equality is simply the definition of conditional probability, P(A|B) =P(AB)/P(B)

2.4 Multivariate Normal Distribution

The bivariate normal distribution

Two continuous random variables X and Y have a bivariate normal distribution if the joint density of (X, Y) is the bivariate normal density function, which is given by

f(x,y)=12πσ1σ21ρ2exp{12(1ρ2)[(xμ1σ1)22p(xμ1σ1)(yμ2σ2)+(yμ2σ2)2]},(2.12)

(x, y) 2. The parameters are μ1 = E[X], μ2 = E[Y] σ12=var(X),σ22=var(Y), and ρ = Cor(X, Y). The notation (X, Y) ~ BVN(μ1, μ2, σ12, σ22,ρ) indicates that (X, Y) have the joint pdf (2.12). Some properties of the bivariate normal distribution (2.12) are:

  1. The marginal distributions of X and Y are normal; that is X ~ N(μ1, σ12 and Y ~ N(μ2,  σ22.
  2. The conditional distribution of Y given X = x is normal with mean μ2 + ρσ21(x − μ1) and variance  σ22 (1 − ρ2).
  3. The conditional distribution of X given Y = y is normal with mean μ1 + ρσ12(y − μ2) and variance σ12 (1 − ρ2).
  4. X and Y are independent if and only if ρ = 0.

Suppose (X1, X2) ~ BVN(μ1, μ2, σ12, σ22 , ρ). Let μ = (μ1, μ2)T and

=[σ11σ12σ21σ22],

where σij = Cov(Xi,Xj). Then the bivariate normal pdf (2.12) of (X1, X2) can be written in matrix notation as rll

f(x1,x2)=1(2π)|  |1/2exp{12(xμ)T ​1(xμ)},

where x=(x1,x2)T2.

The multivariate normal distribution

The joint distribution of continuous random variables X1, ... , Xd is multivariate normal or d-variate normal, denoted Nd(μ,Σ ), if the joint pdf is given by

f(x1,...,xd)=1(2π)d/2| ​ |1/2exp{12(xμ)T ​1(xμ) },  (2.13)

where ∑  is the d × d nonsingular covariance matrix of (X1, ... , Xd)T, μ = (μ1, ... , μd)T is the mean vector, and x = (x1, ... , xd)T ∈ ℝd.

The one-dimensional marginal distributions of a multivariate normal variable are normal with mean μi and variance σi2, i = 1, ... , d. Here σi2 is the ith entry on the diagonal of ∑. In fact, all of the marginal distributions of a multivariate normal vector are multivariate normal (see e.g. Tong [273, Sec. 3.3]).

The normal random variables X1, ... , Xd are independent if and only if the covariance matrix ∑ is diagonal.

Linear transformations of multivariate normal random vectors are multivariate normal. That is, if C is an m × d matrix and b = (b1, ... , bm)T∈ ℝm, then Y = CX + b has the m-dimensional multivariate normal distribution with mean vector + b and covariance matrix CCT.

Applications and properties of the multivariate normal distribution are covered by Anderson [8] and Mardia et al. [188]. Refer to Tong [273] for properties and characterizations of the bivariate normal and multivariate normal distribution.

2.5 Limit Theorems

Laws of Large Numbers

The Weak Law of Large Numbers (WLLN) or (LLN) states that the sample mean converges in probability to the population mean. Suppose that X1, X2... are independent and identically distributed (iid), E|X1| < ∞ and μ = E[X1]. For each n let X¯n=1ni=1nXi. Then X¯nμ almost surely as n →∞ That is, for every ∈>0

limn0P(|Xn¯μ|<)=1.

For a proof, see e.g. Durrett [77].

The Strong Law of Large Numbers (SLLN) states that the sample mean converges almost surely to the population mean μ. Suppose that X1, X2,... are pairwise independent and identically distributed, E|X1| < ∞ and μ = E [X1]For each n For each n let X¯n=1ni=1nXi. Then X¯nμ almost surely as n →∞ That is, for every ∈>0

P(limn0|Xn¯μ|<)=1.

For Etemadi’s proof see Durrett [77].

Central Limit Theorem

The first version of the Central Limit Theorem was proved by de Moivre in the early 18th century for random samples of Bernoulli variables. The general proof was given independently by Lindeberg and Lévy in the early 1920’s.

THEOREM 2.1 (The Central Limit Theorem)

If X1, ... , Xn is a random sample from a distribution with mean μ and finite variance σ2> 0, then the limiting distribution of

Zn=X¯μσ/n

is the standard normal distribution.

See Durrett [77] for the proofs.

2.6 Statistics

Unless otherwise stated, X1, ... , Xn is a random sample from a distribution with cdf FX(x) = P(X < x), pdf or pmf fX(x), mean E[X] = μX and variance σX2. The subscript X on F, f, μ, and σ is omitted when it is clear in context. Lowercase letters x1,... , xn denote an observed random sample.

A statistic is a function Tn = T(X1, ... , Xn) of a sample. Some examples of statistics are the sample mean, sample variance, etc. The sample mean is X¯=1ni=1nXi, and sample variance is

S2=1n1i=1n(XiX¯)2=i=1nXi2n X¯2n1.

The sample standard deviation is S=S2.

The empirical distribution function

An estimate of F(x) = P(X ≤ x) is the proportion of sample points that fall in the interval (−∞, x]. This estimate is called the empirical cumulative distribution function (ecdf) or empirical distribution function (edf). The ecdf of an observed sample x1, ... , xn is defined by

Fn(x)={0,   x<x (1),in,x (i)x (i+1),  i=1,...,n1,1,x (n)x,

where x(1) x(2) ≤ · · · ≤ x(n) is the ordered sample.

A quantile of a distribution is found by inverting the cdf. The cdf may not be strictly increasing, however, so the definition is as follows. The q quantile of a random variable X with cdf F(x) is

Xq=infx{x:  F(x)q},   0<q<1.

Quantiles can be estimated by the inverse ecdf of a random sample or other function of the order statistics. Methods for computing sample quantiles differ among statistical packages R, SAS, Minitab, SPSS, etc. (see Hyndman and Fan [148] and the quantile help topic in R).

R note 2.1The default method of estimation used in the R quantile function assigns cumulative probability (k − 1)/(n − 1) to the kth order statistic. Thus, the empirical cumulative probabilities are defined

0,1n1,2n1,...,n2n1,1.

Note that this set of probabilities differs from the usual assignment{k/n}k=1n of the ecdf.

Bias and Mean Squared Error

A statistic θ^n is an unbiased estimator of a parameter θ if E[θ^n]=θ An estimatoris θ^nasymptotically unbiased for θ if

limnE[θ^n]=θ

The bias of an estimator θ^ for a parameter θ is defined bias(θ^)=E[θ^]θ. Clearly X¯ is an unbiased estimator of the mean μ=E[ θ^]θ.. It can be shown that E[S2] = σ2 = Var(X), so the sample variance S2 is an unbiased estimator of σ2. The maximum likelihood estimator of variance is

σ^2=1ni=1n(Xi X¯)2

which is a biased estimator of σ2. However, the bias −σ2/n tends to zero as n → ∞, so σ^2 is asymptotically unbiased for σ2.

The mean squared error (MSE) of an estimator θ^ for parameter θ is

MSE(θ^)=E[( θ^θ)2].

Notice that for an unbiased estimator the MSE is the equal to the variance of the estimator. If θ^ is biased for θ, however, the MSE is larger than the variance. In fact, the MSE can be split into two parts,

MSE( θ^)=E[ θ^22θ θ^+θ2]=c2θE[θ^]+θ2=E[ θ^2](E[ θ^])2+(E[ θ^])22θE[θ^]+θ2=var( θ^)+(E[ θ^]θ)2,

so the MSE is the sum of variance and squared bias:

MSE( θ^)=Var( θ^)+[bias(θ^)]2

The standard error of an estimator θ^ is the square root of the variance:se(θ^)=Var(θ^). An important example is the standard error of the mean

se(X¯)=Var(X¯)=Var(X)n=σxn

A sample proportion p^. The standard error of a sample proportion is p(1p)/n. Note that se(p^)0.5/n.

For each fixed x ∈ℝ, the ecdf Fn(x) is an unbiased estimator of the cdf F (x).The standard error of Fn(x) is F(x)(1F(x))/n0.5/n

The variance of the q sample quantile [63, 2.7] is

Var(x^q)=q(1q)nf(xq)2,(2.14)

where f is the density of the sampled distribution. When quantiles are estimated, the density f is usually unknown, but (2.14) shows that larger samples are needed for estimates of quantiles in the part of the support set where the density is close to zero.

Method of Moments

The rth sample moment mr=1ni=1nXir,, r = 1, 2,... is an unbiased estimator of the r population moment E[X], provided that the r moment exists. If X has density f(x; θ1, ... , θk), then the method of moments estimator of θ = (θ1, ... , θk) is given by the simultaneous solution θ^=(θ^1,...,θ^k) of the equations

E[Xr]=mr(x1,...xn)=1ni=1nxir,r=1,...,k.

The Likelihood Function

Suppose that the sample observations are iid from a distribution with density function f(X|θ), where θ is a parameter. The likelihood function is the conditional probability of observing the sample, given θ, which is given by

L(θ)=i=1nf(xi|θ).(2.15)

The parameter θ could be a vector of parameters, θ = (θ1, ... , θp). The likelihood function regards the data as a function of the parameter(s) θ. As L(θ) is a product, it is usually easier to work with the logarithm of L(θ), called the log likelihood function,

l(θ)=log(L(θ))=i=1nlog f(xi|θ). (2.16)

Maximum Likelihood Estimation

The method of maximum likelihood was introduced by R. A. Fisher. By maximizing the likelihood function L(θ) with respect to θ, we are looking for the most likely value of θ given the information available, namely the sample data. Suppose that Θ is the parameter space of possible values of θ. If the maximum of L(θ) exists and it occurs at a unique point θ^ Θ, then θ^ is called the maximum likelihood estimator of L(θ). If the maximum exists but is not

unique, then any of the points where the maximum is attained is an MLE of θ. For many problems, the MLE can be determined analytically. However, it is often the case that the optimization cannot be solved analytically, and in that case numerical optimization or other computational approaches can be applied.

Maximum likelihood estimators have an invariance property. This property states that if θ^ is an MLE of θ and τ is a function of θ, then τ(θ^) is an MLE of τ(θ).

Note that the maximum likelihood principle can also be applied in problems where the observed variables are not independent or identically distributed (the likelihood function (2.15) given above is for the iid case).

Example 2.5 (Maximum likelihood estimation of two parameters)

Find the maximum likelihood estimator of θ = (λ, η) for the two-parameter exponential distribution (see Example 2.3). Suppose that x1, ... , xn is a random sample from the Exp(λ, η) distribution. The likelihood function is

L(θ)=L(λ,η)=i=1nλeλ(xi-η)I(xiη)

where I(·) is the indicator variable (I(A) = 1 on set A and I(A) = 0 on the complement of A). Then if x(1) = min{x1, ... , xn}, we have

L(θ)=L(λ,η)=λnexp{λi=1n(xiη)},x(1)η,

and the log-likelihood is given by

l(θ)=l(λ,η)=ηlogλλi=1n(xiη),x(1)η

Then l(θ) is an increasing function of η for every fixed λ, and η ≤ x(1), so n^=x(1). To find the maximum of l(θ) with respect to λ, solve

l(λ,η)λ=ηλi=1n(xiη)=0,

to find the critical point λ = 1/(x − η). The MLE of θ = (λ, η) is

(λ^,η^)=(1x¯x(1),x(1)).

Example 2.6 (Invariance property of MLE)

Find the maximum likelihood estimator of the α-quantile of the Exp(λ, η) distribution in Examples 2.3 and 2.5. From Example 2.3 we have

F(x)=1eλ(xn),       xn.

Therefore F(xα) = α implies that

xα=1λlog(1-α)+η,

and by the invariance property of maximum likelihood, the MLE of xa is

x^α=(x¯x(1)log(1α)+x(1)

2.7 Bayes’ Theorem and Bayesian Statistics

The Law of Total Probability

If events A1,...,Ak partition a sample space S into mutually exclusive and exhaustive nonempty events, then the Law of Total Probability states that the total probability of an event B is given by

P(B)=P(A1B)+P(A2B)+...+P(AkB)            =P(B|A1)P(A1)+P(B|A2)P(A2)+...+P(B|Ak)P(Ak)            =j=1kP(B|Aj)P(Aj).

For continuous random variables X and Y we have the distributional form of the Law of Total Probability

fy(y)=fy|x=(y)fx(x)dx

For discrete random variables X and Y we can write the distributional form of the Law of Total Probability as

fy(y)=P(Y=y)xP(Y=y|X=x)

Bayes’ Theorem

Bayes’ Theorem provides a method for inverting conditional probabilities. In its simplest form, if A and B are events and P(B) > 0, then

P(A|B)=P(B|A)P(A)P(B)

Often the Law of Total Probability is applied to compute P(B) in the denominator. These formulas follow from the definitions of conditional and joint probability.

For continuous random variables the distributional form of Bayes’ Theorem is

fx|Y=y(x)fY|x=x(y)fx(x)fY(y)=fY|x=x(y)fx(x)fY|x=x(y)fx(x)dx

fX|Y=y(x)=P(X=x|Y=y)=P(Y=y|X=x)P(X=x)x{P(Y=y|X=x)P(X=x)}

These formulas follow from the definitions of conditional and joint probability.

Bayesian Statistics

In the frequentist approach to statistics, the parameters of a distribution are considered to be fixed but unknown constants. The Bayesian approach views the unknown parameters of a distribution as random variables. Thus, in Bayesian analysis, probabilities can be computed for parameters as well as the sample statistics.

Bayes’ Theorem allows one to revise his/her prior belief about an unknown parameter based on observed data. The prior belief reflects the relative weights that one assigns to the possible values for the parameters. Suppose that X has the density f(x|θ). The conditional density of θ given the sample observations x1, ... , xn is called the posterior density, defined by

fθ|x(θ)=f(x1,...,xn|θ)fθ(θ)f(x1,...,xn|θ)fθ(θ)dθ

where fθ(θ) is the pdf of the prior distribution of θ. The posterior distribution summarizes our modified beliefs about the unknown parameters, taking into account the data that has been observed. Then one is interested in computing posterior quantities such as posterior means, posterior modes, posterior standard deviations, etc.

Note that any constant in the likelihood function cancels out of the posterior density. The basic relation is

posterior prior×  likelihood,

which describes the shape of the posterior density up to a multiplicative constant. Often the evaluation of the constant is difficult and the integral cannot be obtained in closed form. However, Monte Carlo methods are available that do not require the evaluation of the constant in order to sample from the posterior distribution and estimate posterior quantities of interest. See e.g. [44, 103103, 106, 120, 228] on development of Markov Chain Monte Carlo sampling.

Readers are referred to Lee [171] for an introductory presentation of Bayesian statistics. Albert [5] is a good introduction to computational Bayesian methods with R. A textbook covering probability and mathematical statistics from both a classical and Bayesian perspective at an advanced undergraduate level is DeGroot and Schervish [64].

2.8 Markov Chains

In this section we briefly review discrete time, discrete state space Markov chains. A basic understanding of Markov chains is necessary background for Chapter 9 on Markov Chain Monte Carlo methods. Readers are referred to Ross [234, Ch. 4] for an excellent introduction to Markov chains.

A Markov chain is a stochastic process {Xt} indexed by time t ≥ 0. Our goal is to generate a chain by simulation, so we consider discrete time Markov chains. The time index will be the nonnegative integers, so that the process starts in state X0 and makes successive transitions to X1, X2,... , Xt, .... The set of possible values of Xt is the state space.

Suppose that the state space of a Markov chain is finite or countable. Without loss of generality, we can suppose that the states are 0, 1, 2, .... The sequence {Xt|t ≥ 0} is a Markov chain if

P(Xt+1= j|X0=i0,X1=i1,...,Xt-1=it-1,Xt=i)=P(Xt+1=j|Xt= i)

for all pairs of states (i, j),t ≥ 0. In other words, the transition probability depends only on the current state, and not on the past.

If the state space is finite, the transition probabilitiesP(Xt+1|Xt) can be represented by a transition matrix]= (pij) where the entry pij is the probability that the chain makes a transition to state j in one step starting from state i. The probability that the chain moves from state i to state j in k steps is pij(k), and the Chapman-Kolmogorov equations (see e.g. [234, Ch. 4]) provide that the k-step transition probabilities are the entries of the matrix k. That is, (k) = pij(k)=(k) the kth power of the transition matrix.

A Markov chain is irreducible if all states communicate with all other states: given that the chain is in state i, there is a positive probability that the chain

can enter state j in finite time, for all pairs of states (i, j). A state i is recurrent if the chain returns to i with probability 1; otherwise state i is transient. If the expected time until the chain returns to i is finite, then i is nonnull or positive recurrent. The period of a state i is the greatest common divisor of the lengths of paths starting and ending at i. In an irreducible chain, the periods of all states are equal, and the chain is aperiodic if the states all have period 1. Positive recurrent, aperiodic states are ergodic. In a finite-state Markov chain all recurrent states are positive recurrent.

In an irreducible, ergodic Markov chain the transition probabilities converge to a stationary distribution π on the state space, independent of the initial state of the chain.

In a finite-state Markov chain, irreducibility and aperiodicity imply that for all states j

πj=limnpij(n)

exists and is independent of the initial state i. The probability distribution π = {πj} is called the stationary distribution, and π is the unique nonnegative solution to the system of equations

πj=i=0πipij,j0;j=0πj=1.(2.17)

We can interpret πj as the (limiting) proportion of time that the chain is in state j.

Example 2.7 (Finite state Markov chain)

Ross [234] gives the following example of a Markov chain model for mutations of DNA. A DNA nucleotide has four possible values. For each unit of time the model specifies that the nucleotide changes with probability 3α, for some0<α< 1/3. If it does change, then it is equally likely to change to any of the other three values. Thus pii = 1 3α and pij = 3α/3 = α, i≠j. If we number the states 1 to 4, the transition matrix is

=[13ααααα13ααααα13ααααα13α](2.18)

where pij=i,j is the probability of a mutation from state i to state j. The ith row of a transition matrix is the conditional probability distribution P(Xn+1 = j|Xn = i), j = 1, 2, 3, 4 of a transition to state j given that the process is currently in state i. Thus each row must sum to 1 (the matrix is row stochastic). This matrix happens to be doubly stochastic because the columns also sum to 1, but in general a transition matrix need only be row stochastic.

Suppose that α = 0.1. Then the two-step and the 16-step transition matrices are

2=[0.520.160.160.160.160.520.160.160.160.160.520.160.160.160.160.52]16=[0.26260.24580.24580.24580.24580.26260.24580.24580.24580.24580.26260.24580.24580.24580.24580.2626]

The three-step transition matrix is 2 = 3, etc.The probability p14(2) of transition from state 1 to state 4 in two steps is p1,4(2)=0.16, and the probability that the process returns to state 2 from state 2 in 16 steps is p22(16)=2,216=0.2626.

All entries of P are positive, hence all states communicate; the chain is irreducible and ergodic. The transition probabilities in every row are converging to the same stationary distribution π on the four states. The stationary distribution is the solution of equations (2.17); in this case π(i)=14,i = 1, 2, 3, 4. (In this example, it can be shown that the limiting probabilities do not depend on α:iin=14+34(14α)n14as n.)

Example 2.8(Random walk)

An example of a discrete-time Markov chain with an infinite state space is the random walk. The state space is the set of all integers, and the transition probabilities are

pi,i+1=p, i=0±1±2,...,pi,i1=1p, i=0±1±2,...,pi,j j{i1,i+1}.

In the random walk model, at each transition a step of unit length is taken at random to the right with probability p or left with probability 1 − p. The state of the process at time n is the current location of the walker at time n. Another interpretation considers the gambler who bets $1 on a sequence of Bernoulli(p) trials and wins or loses $1 at each transition; if X0 = 0, the state of the process at time n is his gain or loss after n trials.

In the random walk model all states communicate, so the chain is irreducible. All states have period 2. For example, it is impossible to return to state 0 starting from 0 in an odd number of steps. The probability that the first return to 0 from state 0 occurs in exactly 2n steps is

p00(2n)=(2nn)pn (1p)n=(2n)!n!n!(p(1p))n

It can be shown that n=1p00(2n)< if and only if p≠ 1/2 Thus, the expected number of visits to 0 is finite if and only if p≠ 1/2. Recurrence and transience are class properties, hence the chain is recurrent if and only if p = 1/2 and otherwise all states are transient. When p = 1/2 the process is called a symmetric random walk. The symmetric random walk is discussed in Example 3.26.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset