Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Essentials of General Probability Theory

In this chapter we present some measure-theoretic foundations of probability theory and random variables. This then leads us into the main tools and formulas for computing expectations and conditional expectations of random variables (more importantly, continuous random variables) under different probability measures. The main formulas that are provided here are used in further chapters for the understanding and quantitative modelling of continuous-time stochastic financial models.

9.1 Random Variables and Lebesgue Integration

Within the foundation of general probability theory, the mathematical expectation of any real-valued random variable X defined on a probability space (Ω, ℱ, ℙ) is a so-called Lebesgue integral w.r.t. a given probability measure ℙ over the sample space Ω. In order to define an integral of a random variable (i.e., a measurable set function) w.r.t. some given measure, we need a measurable space and a measure. The measurable space here is the pair (Ω, ℱ), where ℱ is a σ-algebra of events in Ω, and the measure is the probability measure function ℙ: Ω → [0, 1]. Before providing a precise definition of such an integral, we make a couple of remarks. Namely, Ω is any abstract set having either a finite, infinitely countable, or uncountable number of elements. So here we are dealing with a general probability space that includes the finite probability spaces studied in previous chapters as special cases. As usual, we denote every element in Ω by ω, where {ω} is a singleton set in Ω. Any random variable X has a positive part X+ ≔ max{X, 0} ≥ 0 and a negative part X− ≔ max{−X, 0} ≥ 0, where X = X+ − X− and |X| = X+ + X−. Note that both random variables X+ and X− are nonnegative. The expectation of X w.r.t. the measure P is defined as the difference of two nonnegative expectations:

$E [X] = E [X_{+}] - E [X_{-}] .$

We write this, term by term, in the notation of a Lebesgue integral w.r.t. the measure ℙ as follows:

$\underset{E [X]}{\underset{︸}{\int_{Ω} X (ω) d ℙ (ω)}} = \underset{E [X_{+}]}{\underset{︸}{\int_{Ω} X_{+} (ω) d ℙ (ω)}} - \underset{E [X_{-}]}{\underset{︸}{\int_{Ω} X_{-} (ω) d ℙ (ω)}} . (9.1)$

The absolute value of X has expectation E[|X|] = E[X+]+E[X−]. X is said to be integrable w.r.t. measure ℙ iff E[|X|] < ∞, i.e., E[X+] < ∞ and E[X−] < ∞, hence E[X] < ∞. A common notation used to state this is to write X ∊ L1(Ω, ℱ, ℙ). If E[X+] = ∞ and E[X−] < ∞, then we set E[X] = ∞; if E[X+] < ∞ and E[X−] = ∞, then we set E[X] = −∞; if E[X+] = E[X−] = ∞, then E[X] is not defined. Note that for strictly positive or nonnegative X, X+ ≡ X (X− ≡ 0). For strictly negative or nonpositive X we have X− ≡ −X (X+ ≡ 0). The Lebesgue integral of a random variable X over any subset A ∊ ℱ is defined as the Lebesgue integral (over Ω) of the random variable ?A X:

$\int_{A} X (ω) d ℙ (ω) \equiv \int_{Ω} I_{A} (ω) X (ω) d ℙ (ω) \equiv E [I_{A} X] (9.2)$

where ?A (ω) = 1 if ω ∊ A and ?A(ω) = 0 if ω ∉ A. Moreover, putting X ≡ 1 gives the probability under measure ℙ of any event A ∊ ℱ as a Lebesgue integral w.r.t. ℙ over A:

$E [I_{A}] = ℙ (A) = \int_{A} d ℙ (ω) = \int_{Ω} I_{A} (ω) d ℙ (ω) . (9.3)$

Since ℙ is an assumed probability measure, we must have ℙ(Ω) = E[?Ω] = 1.

We now develop the Lebesgue integral of X w.r.t. measure ℙ by first considering its definition for any (ℙ-a.s.) nonnegative real-valued1 random variable X: Ω → [0, ∞). The symbol ℙ-a.s. stands for almost surely w.r.t. to probability measure ℙ, i.e., a relation holds ℙ-a.s. means that it holds with probability one. So X ≥ 0 (ℙ-a.s.) means that the set for which X ≥ 0 has probability one: ℙ({ω ∊ Ω : X(ω) ≥ 0}) = 1. We form a partition of the positive real line: 0 = y0 < y1 < y2 < ... < yk < yk+1 < ... < yn, where yn → ∞ as n → ∞. A partition is defined such that the maximum sub-interval spacing approaches zero, i.e., $\lim_{n \to \infty} \max_{0 \leq k \leq n - 1} (y k + 1 - y k) = 0$ is implied. Then, the sets in ℱ defined by Ak ≔ X−1([yk,yk+1)) ≡ {ω ∊ Ω : yk ≤ X(ω) < yk+1},k = 0,1,..., n − 1, and An ≔ X−1([yn, ∞)) = {ω ∊ Ω : X(ω) ≥ yn} are mutually exclusive and form a partition of the sample space Ω for every n ≥ 0. The Lebesgue integral is defined as the limit of a partial sum:

$\int_{Ω} X (ω) d ℙ (ω) := \lim_{n \to \infty} \sum_{k = 0}^{n} y k ℙ (A k) \equiv \sum_{k = 0}^{\infty} y k ℙ (A_{k}) . (9.4)$

Assuming it exists, this sum gives a unique value (including cases where it may be infinite). This definition uses the lower value yk of X on every set Ak. An equivalent definition replaces the value yk with the upper value yk+1 for every k ≥ 1. For any real X = X+ −X−, we simply use (9.4) for each respective nonnegative Lebesgue integral of X+ and X− and subtract to obtain the Lebesgue integral of X according to (9.1).

Assume {Ai} is a countable collection of disjoint sets in ℱ. Then, applying the identity $I_{\cup_{i} A_{i}} = \sum_{i} I_{A_{i}}$ to (9.2) and using the linearity property of the Lebesgue integral gives

$\int_{\cup_{i} A_{i}} X (ω) d ℙ (ω) = \sum_{i} \int_{Ω} X (ω) I_{A_{i}} (ω) d ℙ (ω) = \sum_{i} \int_{A_{i}} X (ω) d ℙ (ω), (9.5)$

i.e., $E [X I_{\cup_{i} A_{i}}] = \sum E [X I_{A_{i}}] .$ For X ≡ 1 we have $E [I_{\cup_{i} A_{i}}] = \sum_{i} E [I_{A_{i}}] :$

$ℙ (\underset{i}{\cup} A_{i}) = \sum_{i} ℙ (A_{i}) .$

This recovers the countable additivity property of ℙ, which must hold for ℙ to be a measure.

To see how the Lebesgue integral in (9.4) works, consider the class of random variables having the form of a finite sum of indicator random variables $X = \sum_{j = 0}^{N} x_{j} I_{C_{j}}$ with each xj ∊ [0, ∞), Cj = {ω ∊ Ω : X(ω) = xj} ⊂ ℱ, and where ${C_{j}}_{j = 0}^{N}$ forms a partition of Ω. This is called a simple function or simple random variable. Any random variable that can take only finitely many different values, x0,..., xN, (including possibly a zero value) has this form. Note also that any simple random variable can be represented as the difference of two nonnegative simple random variables. For each sufficiently small interval [yk,yk+1) we have Ak = X−1([yk,yk+1)) = X−1(xj) = Cj if there is a value xj ∊ [yk,yk+1) for some 0 ≤ j ≤ N, or otherwise Ak = X−1([yk,yk+1)) = ø. For sufficiently large n, each xj value will be contained in exactly one sub-interval and the nth partial sum in (9.4) does not change value as n is increased indefinitely since only N intervals will contain an xj value (with probability measure ℙ(Cj)) and the rest of the intervals yield Ak = ø (i.e., ℙ(Ak) = 0). Hence, the Lebesgue integral in (9.4) is given as the finite sum

$\int_{Ω} X (ω) d ℙ (ω) = \sum_{j = 0}^{N} x_{j} ℙ (C_{j}) = \sum_{j = 0}^{N} x_{j} ℙ (X = x_{j}) . (9.6)$

This corresponds to the formula in (6.17) for a discrete random variable on a finite state space Ω. However, here the simple random variable is discrete and defined more generally for uncountably infinite Ω. For a discrete random variable X taking on a countably infinite number of values {x0, x1,...}, with infinitely countable or uncountable Ω, it readily follows that the Lebesgue integral recovers the familiar expectation formula:

$E [X] = \int_{Ω} X (ω) d ℙ (ω) = \sum_{j = 0}^{\infty} x_{j} ℙ (X = x_{j})$

where the probabilities pj = ℙ(X = xj) define the probability mass function (PMF) of X in the measure ℙ.

Simple random variables are particularly useful and provide an alternative equivalent way of defining the Lebesgue integral. In standard measure and integration theory, one begins by taking (9.6) as the definition for the Lebesgue integral of any simple random variable. Then, for any X ≥ 0 (a.s.) the Lebesgue integral in (9.4) is equivalently defined as the supremum of Lebesgue integral values over the set of all nonnegative simple random variables having value not greater than X:

$E [X] = \int_{Ω} X (ω) d ℙ (ω) := \sup {\int_{Ω} X^{*} (ω) d ℙ (ω) : X^{*} is simple 0 \leq X^{*} \leq X} . (9.7)$

This last definition is rather abstract but it gives rise to other more explicit representations for the Lebesgue integral upon using the fact that a nonnegative X can be expressed (a.s.) as a limiting sequence of nonnegative simple random variables. Then, the Lebesgue integral of any nonnegative X can be computed as the limit of the Lebesgue integrals corresponding to the sequence of nonnegative simple random variables that converges to X. To see how this arises, we need to first recall the concept of pointwise convergence of a sequence of random variables. We recall that a sequence of random variables, Xn,n = 1,2,..., is said to converge pointwise almost surely to some random variable X when Xn → X (a.s.) as n → ∞, i.e., $ℙ ({ω \in Ω: \lim_{n \to \infty} X_{n} (ω) = X (ω)}) = 1$ . A sequence of nonnegative random variables Xn, n = 1,2,..., is said to converge pointwise monotonically to X (a.s.) if Xn → X and 0 ≤ X1 ≤ X2 ≤ ... ≤ X (a.s.), i.e., ℙ({ω ∊ Ω : Xn(ω) ↗ X(ω)}) = 1. If we have such a sequence, then each successive random variable in the sequence will approximate X better than the previous one and the approximation becomes exact in the limit n → ∞. It stands to reason that the corresponding sequence of expected values (Lebesgue integrals for each Xn) will be increasingly better approximations and will give the expected value (Lebesgue integral) of X in the limit n → ∞. This is summarized (without proof) in the following well-known theorem, appropriately called the Monotone Convergence Theorem (MCT) for random variables.

Theorem 9.1

(Monotone Convergence for random variables). If Xn,n = 1,2,..., is a sequence of nonnegative random variables converging pointwise monotonically to X (a.s.), then

$\lim_{n \to \infty} E [X_{n}] = E [X], i . e ., \lim_{n \to \infty} \int_{Ω} X_{n} (ω) d ℙ (ω) = \int_{Ω} X (ω) d ℙ (ω) .$

In fact, we have monotonic convergence, i.e., E[Xn] ↗ E[X] as n → ∞.

The MCT has many applications. For instance, we recover the known continuity properties of a probability measure ℙ. That is, let A1, A2,..., An,... be subsets (events) in Ω.

If A1 ⊂ A2 ⊂ ..., then $\lim_{n \to \infty} A_{n} = \cup_{n = 1}^{\infty} A_{n}$ and we have (monotone continuity from below):

$ℙ (\cup_{n =1}^{\infty} A_{n}) = \lim_{n \to \infty} ℙ (A_{n}) .$

If A1 ⊃ A2 ⊃ ..., then $\lim_{n \to \infty} A_{n} = \cap_{n = 1}^{\infty} A_{n}$ and we have (monotone continuity from above):

$ℙ (\cap_{n = 1}^{\infty} A_{n}) = \lim_{n \to \infty} ℙ (A_{n}) .$

These follow as a corollary of Theorem 9.1. For example, monotone continuity from above is obtained by defining the sequence of monotonically increasing random variables $X_{n} = 1 - I_{A_{n}},$ where An+1 ⊂ An for all n ≥ 1. Hence, Xn ↗ X ≡ 1 − ?A, $A = \cap_{n = 1}^{\infty} A_{n}$ . By MCT we have $\lim_{n \to \infty} E [X_{n}] = E [X]$ , i.e.,

$\lim_{n \to \infty} E [1 - I_{A_{n}}] = E [1 - I_{A}] \Rightarrow \lim_{n \to \infty} ℙ (A_{n}) = ℙ (A) = ℙ (\cap_{n = 1}^{\infty} A_{n}) .$

We now apply MCT to obtain a more explicit formula for E[X] when X ≥ 0 (a.s.) by producing a monotonically increasing sequence of nonnegative simple random variables. There are many ways to do so. An explicit example is the sequence defined by

$X_{n} (ω) := {\begin{matrix} k / 2^{n} & if ω \in A_{n}^{(k)} \\ 0 & otherwise \end{matrix} (9.8)$

with sets $A_{n}^{(k)} := {ω \in Ω : k / 2^{n} \leq X (ω) < (k + 1) / 2^{n}}, k = 0, ..., 2^{2 n}$ . For every n ≥ 1, Xn is a simple random variable:

$X_{n} (ω) = \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} I_{A_{n}^{(k)}} (ω) = \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} I_{{X \in [\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})}} (ω) . (9.9)$

We leave it as an exercise for the reader to verify that Xn(ω) ↗ X(ω) for any X ≥ 0. Therefore, by using (9.6) as the expected value for every simple Xn and passing to the limit with the use of MCT, the Lebesgue integral (expectation) of any nonnegative random variable X is given by

$E [X] = \lim_{n \to \infty} E [X_{n}] = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} ℙ (\frac{k}{2^{n}} \leq X < \frac{k + 1}{2^{n}}) . (9.10)$

This is equivalent to (9.4) but here the partitions are chosen such that the convergence is monotonic, whereas the series in (9.4) is generally not monotonic. For arbitrary X = X+ − X− we use the above construction for the two separate nonnegative random variables X+ and X− and subtract the two expectations, giving E[X], assuming we don't have ∞ − ∞.

The Lebesgue integral above was presented in the general context of an abstract state space Ω with generally uncountable numbers of abstract elements ω and random variables X (i.e., measurable set functions) on a probability space (Ω, ℱ, ℙ). It is important to consider the case where Ω ⊂ ℝ, i.e., every ω is a real number. Recall our discussion of the Borel σ-algebra ℬ(ℝ). The pair (ℝ, ℬ(ℝ)), i.e., Ω = ℝ, ℱ = ℬ(ℝ), is a measurable space. The Lebesgue measure2 on ℝ, which we denote by m, is a measure that assigns a nonnegative real value or infinity to each Borel set B ∊ ℬ(ℝ), i.e., m : B → [0, ∞) ∪ ∞, such that all intervals [a, b], a ≤ b, have measure equal to their length: m([a, b]) = b − a. All semi-infinite intervals (− ∞, a], (− ∞, a), [b, ∞), or (b, ∞) have infinite Lebesgue measure. A single point has Lebesgue measure zero, m({x}) = 0 for any point x ∊ ℝ. The empty set also has zero measure. Any set B that is a countable union of points also has zero measure since m is countably additive, i.e.,

$m (\cup_{n = 1}^{\infty} B_{n}) = \sum_{n = 1}^{\infty} m (B_{n})$

for all disjoint Borel sets Bn,n ≥ 1. This also implies finite additivity holds where $m (\cup_{n = 1}^{N} B_{n}) = \sum_{n = 1}^{N} m (B_{n})$ for any N ≥ 1. The set of rational numbers ℚ is count able with Lebesgue measure zero. The irrationals ℝ ∩ ℚ∁ = ℝℚ are uncountable. Since (ℝℚ) ∪ ℚ = ℝ, the Lebesgue measure of any Borel set B is unchanged if we remove all the rational numbers from it, i.e., as a disjoint union, B = (Bℚ) ∪ (B ∩ ℚ), hence m(B) = m (Bℚ) + m(B ∩ ℚ) = m(Bℚ). Note that B ∩ ℚ ⊃ ℚ so m(B ∩ ℚ) ≤ m(ℚ) = 0 ⇒ m(B ∩ ℚ) = 0.

There are also other more peculiar Borel sets that are uncountable but yet have Lebesgue measure zero. The well-known Cantor ternary set on [0,1], which is discussed in detail in many textbooks on real analysis, is such an example. In the interest of space, we don't discuss this set in any detail here. It suffices to note that the points in this set are too sparsely distributed over the unit interval [0,1] and hence do not accumulate any length measure. On the other hand, the points in the Cantor set cannot be counted (i.e., listed in a sequence in one-to-one with the integers). The Cantor set is constructed by starting with [0, 1] and removing the middle third interval $(\frac{1}{3}, \frac{2}{3})$ and then repeating this process of removing the middle third for all remaining intervals in succession. One can then make a simple argument to show that the set is not countable, in essentially the same manner that the total number of branches in a binomial tree having an infinite number of time steps cannot be counted either.

By the above discussion, we see that the triplet (Ω, ℱ, ℙ) ≔ ([0, 1], ℬ([0, 1]), m) serves as an example of a probability space where m acts as a uniform probability measure on the set Ω = [0,1] ≡ {x ∊ ℝ : 0 ≤ x ≤ 1}. By our previous notation, we may also write this as Ω = {ω ∊ ℝ : 0 ≤ ω ≤ 1}. Any real-valued random variable X : [0, 1] → ℝ is a Borel (measurable) function where X-1(B) ∊ ℬ([0, 1]) for every B ∊ ℬ(ℝ). Its expected value is then the Lebesgue integral w.r.t. the Lebesgue (uniform probability) measure m:

$E [X] = \int_{[0, 1]} X (ω) d m (ω) . (9.11)$

For example, the outcome of the experiment of picking a real number in [0,1] uniformly at random is captured by the value of the random variable X(ω) = ω. The probability that a number is chosen within some arbitrary subinterval [a, b] ⊂ [0,1] must therefore be the length of the interval. In this case the event is represented as the set {X ∊ [a, b]} = {a ≤ ω ≤ b}. Its probability is the Lebesgue integral of the indicator random variable $I_{{X \in [a, b]}} (ω) = I_{{a \leq ω \leq b}}$ :

$ℙ (X \in [a, b]) = E [I_{{X \in [a, b]}}] = \int_{[0, 1]} I_{[a, b]} (ω) d m (ω) \equiv \int_{[a, b]} d m (ω) = m ([a, b]) = b - a .$

Note that ℙ(Ω) = ℙ(X ∊ [0,1]) = m([0,1]) = 1. Combining this with the countable additivity property of m and the fact that m : B → [0,1] for any B ∊ ℬ([0,1]) = ℬ(ℝ) ∩ [0,1] shows that the Lebesgue measure m restricted to the unit interval [0,1] is a proper probability measure. Note that we also get the probability of picking any finite or countable set of numbers in [0,1] is zero since the Lebesgue measure of such a set is zero. In fact, the probability of picking a rational number is zero:

$ℙ (X \in [0, 1] \cap ℚ) = \int_{[0, 1] \cap ℚ} d m (ω) = m ([0, 1] \cap ℚ) = 0.$

The probability of picking an irrational number is 1, ℙ(X ∊ [0,1]ℚ) = m([0,1]ℚ) = 1. As well, the probability that we pick a number to be in the Cantor set C is zero since ℙ(X ∊ C) = m(C) = 0.

If we have a Lebesgue-measurable set Ω ⊂ ℝ with finite nonzero measure, 0 < m(Ω) < ∞, then the Lebesgue measure restricted to Ω, denoted by mΩ, gives mΩ(B) = m(B) for any set B ∊ ℬ(Ω). [Note: We assume that Ω is a Borel set since any Lebesgue-measurable set is either a Borel set or close to it in the sense that all sets that are Lebesgue-measurable and not Borel are null sets of Lebesgue measure zero.] The set Ω has "nonzero finite length." Typically, Ω is an interval or a combination of intervals, but need not be. In the above example, Ω = [0,1] (i.e., the unit interval) and m[0,1](B) = m(B) for any B ∊ ℬ([0,1]). Then, mΩ : B → [0, c], c = m(Ω), for any B ∊ ℬ(Ω). So mΩ is a uniform measure on the space (Ω, ℬ(Ω)). We simply normalize this measure to obtain a uniform probability measure defined by $ℙ (B) : = \frac{1}{c} . m_{Ω} (B),$ for all B ∊ ℬ(Ω), with $ℙ (Ω) = \frac{1}{c} . m_{Ω} (Ω) = \frac{c}{c} = 1.$ Hence, (Ω, ℬ(Ω), ℙ) is a probability space where ℬ(Ω) contains all the events.

At this point we recall (from real analysis) the Lebesgue integral over ℝ w.r.t. the Lebesgue measure m, which is defined regardless of any association to a probability space. The Lebesgue integral, w.r.t. m over ℝ, is defined for any Lebesgue-measurable function f, i.e., if f−1 (I) is a Lebesgue-measurable set for any interval I ∊ ℝ. For our purposes, it suffices to assume that f is any Borel-measurable real-valued function, i.e., the set f−1(B) ≡ {x ∊ ℝ : f(x) ∊ B} ∊ ℬ(ℝ) for any B ∊ ℬ(ℝ). The Lebesgue integral (w.r.t. m) over ℝ is first defined for a nonnegative Borel function f. By this we mean f is nonnegative almost everywhere (abbreviated a.e. or m-a.e.), i.e., the set of points where f < 0 has Lebesgue measure zero: m({x ∊ ℝ : f(x) < 0}) = 0. Essentially, by putting Ω = ℝ, F = ℬ(ℝ), and replacing ℙ(ω) by m(x) and X(ω) by f(x), equivalent definitions follow as those displayed above for the probability (expectation) Lebesgue integrals on the generally abstract measurable sets (Ω, F) with measure ℙ. In the Lebesgue integral the measurable space is (ℝ, ℬ(ℝ)) and the measure is the Lebesgue measure m. The Lebesgue integral of a nonnegative Borel function f (w.r.t. m over ℝ) can be defined by

$\int_{ℝ} (x) d m (x) : = \lim_{n \to \infty} \sum_{k = 0}^{n} y k m (B_{k}), (9.12)$

assuming the sum converges in ℝ or equals ∞. This is the analogue of the definition in (9.4). Here, the partitions yk,k ≥ 0 are defined above (9.4) and Bk ≔ f−1([yk, yk+1)) ≡ {x ∊ ℝ : yk ≤ f(x) < yk+1}, k = 0,1,..., n − 1, Bn ≔ f−1([yn, ∞)) = {x ∊ ℝ: f(x) ≥ yn}.

An equivalent (and more standard definition) is to first define the integral for any simple function $φ (x) = \sum_{k = 1}^{n} a_{k} I_{A_{k}} (x),$ with ak' s as real numbers and Ak' s as Lebesgue-measurable sets (for our purposes these are Borel sets in ℝ):

$\int_{ℝ} φ (x) d m (x) : = \sum_{k = 1}^{n} a_{k} m (A_{k}) . (9.13)$

[Note that the usual convention 0.∞ = 0 is used throughout.] Then, we define the Lebesgue integral for any nonnegative f as the supremum over the set of Lebesgue integral values of all nonnegative simple functions that are less than or equal to f:

$\int_{ℝ} f (x) d m (x) : = \sup {\int_{ℝ} φ (x) d m (x) : φ is a simple function, 0 \leq φ \leq f} . (9.14)$

The Lebesgue integral, w.r.t. m over ℝ, of any Lebesgue-measurable (or Borel function) f = f+ − f−, with f± ≥ 0, is then

$\int_{ℝ} f (x) d m (x) = \int_{ℝ} f_{+} (x) d m (x) - \int_{ℝ} f_{-} (x) d m (x) . (9.15)$

As in the above case of the expectation of a random variable, f is integrable iff $\int_{ℝ} f_{+} d m < \infty and \int_{ℝ} f_{-} d m < \infty, i .e ., | f | = f_{+} + f_{-}$ has a finite integral. If $\int_{ℝ} f_{+} d m = \infty and \int_{ℝ} f_{-} d m < \infty, then \int_{ℝ} f d m = \infty; if \int_{ℝ} f_{+} d m < \infty and \int_{ℝ} f_{-} d m = \infty, then \int_{ℝ} f d m = - \infty; if \int_{ℝ} f_{+} d m = \int_{ℝ} f_{-} d m = \infty, then \int_{ℝ} f d m = \infty - \infty$ is not defined. Note that the Lebesgue integral of f over any measurable (Borel) set A ∊ ℬ(ℝ) is the integral of the (Borel) measurable function?A f over ℝ:

$\int_{A} f (x) d m (x) = \int_{ℝ} I_{A} (x) f (x) d m (x) . (9.16)$

[Note: f is assumed (Borel) measurable so that the function ?A f is also measurable for any measurable set A.] The positive and negative parts of ?A f are (?A f)± = ?A f±, giving

$\begin{array}{l} \int_{A} f (x) d m (x) & = \int_{ℝ} I_{A} (x) f_{+} (x) d m (x) - \int_{ℝ} I_{A} (x) f - (x) d m (x) \\ = \int_{A} f_{+} (x) d m (x) - \int_{A} f - (x) d m (x) . \end{array} (9.17)$

The MCT for random variables (Theorem 9.1) has an obvious analogue for sequences of Lebesgue-measurable functions, which we now state for Borel-measurable functions. Recall that a sequence of functions, fn, n = 1,2,..., converges pointwise almost everywhere (a.e.) to a function f when fn(x) → f(x) (a.e.) as n → ∞. That is, the set for which convergence does not hold is a null set with Lebesgue measure zero: $m ({x \in ℝ : \lim_{n \to \infty} f_{n} (x) \neq f (x)}) = 0$ . A sequence of nonnegative functions fn, n = 1,2,..., is said to converge pointwise monotonically to f (a.e.) if fn(x) → f(x) and 0 ≤ f1(x) ≤ f2(x) ≤ ... ≤ f(x) (a.e.), i.e., the set of values of x ∊ ℝ for which these relations do not hold has Lebesgue measure zero. If we have such a sequence, then each successive function will better approximate f and the approximation becomes exact in the limit n → ∞. Moreover, if the sequence of functions are Borel functions, then their corresponding Lebesgue integrals converge to the Lebesgue integral of the limiting function f. This is summarized in the MCT (Monotone Convergence Theorem) for functions. Its proof is given in any standard textbook on real analysis.

Theorem 9.2

(Monotone Convergence for functions). If fn(x), n = 1, 2,..., x ∊ ℝ, is a sequence of nonnegative Borel functions converging pointwise monotonically (a.e.) to some function f, then

$\lim_{n \to \infty} \int_{ℝ} f_{n} (x) d m (x) = \int_{ℝ} f (x) d m (x) .$

In fact, we have monotonic convergence, $i . e ., \int_{ℝ} f_{n} (x) d m (x) = \int_{ℝ} f (x) d m (x), a s n \to \infty .$

For every nonnegative measurable (or Borel) function f there exists a sequence of non-negative simple functions, fn, n ≥ 1, converging monotonically to f, i.e., fn (x)↗ f(x) (a.e.) as n → ∞. In fact, the analogue of the sequence in (9.9) is the sequence of simple functions:

$f_{n} (x) = \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} I_{{f - 1 ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}}))}} (x) \equiv \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} I_{{f (x) \in [\frac{k}{2^{n}}, \frac{k + 1}{2^{n}}]}} . (9.18)$

Using MCT (Theorem 9.2) and the simple formula in (9.13) gives us the following repre-sentation for the Lebesgue integral, w.r.t. m over ℝ, of any nonnegative function f:

$\int_{ℝ} f (x) d m (x) = \lim_{n \to \infty} \int_{ℝ} f_{n} (x) d m (x) = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} m (A_{k}^{n}), (9 .19)$

where $A_{k}^{n} := f^{- 1} ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})) \equiv {x \in ℝ : \frac{k}{2^{n}} \leq f (x) < \frac{k + 1}{2^{n}}} .$ . Such a series can be obtained for both f+ and f− and then the Lebesgue integral of f = f+ − f− w.r.t. m over ℝ is given by the difference, provided the result is defined (not ∞ − ∞). Note also that the Lebesgue integral of a nonnegative f over any measurable (Borel) set B, using (9.16), has the representation

$\int_{B} f (x) d m (x) = \lim_{n \to \infty} \int_{ℝ} I_{B} (x) f_{n} (x) d m (x) = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} m (A_{k}^{n} \cap B) . (9 .20)$

The theory of Lebesgue integration w.r.t. m provides a framework for integrating a general class of real-valued functions. Lebesgue integration also forms the foundation for probability theory. The actual computation of many specific integrals is, in practice, a difficult task, as there are no known techniques for integrating particular functions. However, most Lebesgue integrals that we encounter are equivalent to a corresponding Riemann integral. This then allows us to use all the known powerful techniques of elementary calculus to compute Riemann integrals. Consider a continuous function f : [a, b] → ℝ. Then, f is integrable and the function $F (x) := \int_{[a, x]} f (y) d m (y) \equiv \int_{a}^{x} f (y) d m (y)$ is differentiable for x ∊ (a, b), i.e., F'(x) = f(x). This is the fundamental theorem of calculus. The theorem just below relates the Lebesgue integral w.r.t. m and the Riemann integral of a bounded function over a finite interval. This theorem is proven in most standard textbooks on real analysis. Note that the statement "f is continuous (a.e.)" means that the set of points for which f is not continuous has Lebesgue measure zero.

Theorem 9.3

(Lebesgue versus Riemann Integration). Let f : [a, b] → ℝ be bounded. Then:

(i) f is Riemann-integrable, i.e., $\int_{a}^{b} f (x) d x$ is denned, if and only if f is continuous (a.e.).
(ii) If f is Riemann-integrable, then the Lebesgue integral over [a, b] is also defined and the two integrals are the same, i.e., $\int_{a}^{b} f (x) d x = \int_{[a, b]} f (x) d m (x) .$

Hence, when computing the Lebesgue integral of a function over an interval with a well-defined Riemann integral, we can simply equate the Lebesgue integral with the corresponding Riemann integral. For example, we write $\int_{a}^{b} f (x) d x \equiv \int_{[a, b]} f (x) d m (x) .$ As well, assuming the existence of Riemann integrals for semi-infinite or infinite intervals, we write $\int_{a}^{\infty} f (x) d x \equiv \int_{[a, \infty]} f (x) d m (x), \int_{- \infty}^{b} f (x) d x \equiv \int_{(- \infty, b]} f (x) d m (x), \int_{- \infty}^{\infty} f (x) d x \equiv \int_{ℝ} f (x) d m (x) .$ The same goes for other improper well-defined Riemann integrals. More generally, if the integral is over some arbitrary Borel set B, it is also customary to use shorthand notation for the Lebesgue integral w.r.t. m over B as $\int_{B} f (x) d x \equiv \int_{B} f (x) d m (x) .$ If the integrand function f (or f · ?B) is not continuous (a.e.), then the Riemann integral is not defined and the integral is understood to be the corresponding Lebesgue integral, assuming it exists.

The expected value of a general real-valued random variable X defined on a probability space (Ω, F, ℙ) is a Lebesgue integral w.r.t. ℙ over Ω. This is a construction that is general, yet not always practical when dealing with a generally abstract sample space Ω. For discrete random variables, the expectation reduces to a sum involving the probability mass function. We also saw that, for a continuous uniform random variable, its expectation (Lebesgue integral w.r.t. ℙ) reduces to a Lebesgue integral w.r.t. m and hence the latter can also be expressed as a Riemann integral. The transformation from a Lebesgue integral w.r.t. measure ℙ over Ω into a Lebesgue integral (or Riemann) over ℝ makes the theory more practical. This allows E[X] to be expressed in terms of integrals over ℝ, rather than over Ω, as follows.

We begin by recalling what a distribution measure is for a random variable X. Let B be any Borel set in ℝ. The distribution measure of X w.r.t. a probability measure ℙ is defined by the set function

$μ_{X} (B) := ℙ(X^{- 1} (B)) \equiv ℙ (X \in B) . (9.21)$

It is important to note that μX measures subsets in ℝ, whereas ℙ measures subsets in Ω. Namely, the probability of the event {X ∊ B} is computed as a μX-measure of B. The cumulative distribution function (CDF), FX, of the random variable X, w.r.t. ℙ, is then given in terms of this measure by

$F_{X} (x) := μ_{X} ((- \infty, x]) = ℙ (X \in (- \infty, x]) \equiv ℙ (X \leq x), x \in ℝ . (9.22)$

Recall that any CDF is generally a right-continuous monotone nondecreasing function with limiting values limx→−∞ FX (x) ≡ FX (−∞) = 0 and limx→∞ FX(x) ≡ FX(∞) = 1.

Since the measure ℙ is countably additive, μX is countably additive. Indeed, for any countable collection of pairwise disjoint Borel sets {Bi} the corresponding pre-images {X−1(Bi)} are pairwise disjoint sets in Ω. Hence,

$μ_{X} (\underset{i}{\cup} B_{i}) = ℙ (X^{- 1} (\underset{i}{\cup} B_{i})) = ℙ (\underset{i}{\cup} X^{- 1} (B_{i})) = \sum_{i} ℙ (X^{- 1} (B_{i})) = \sum_{i} μ_{X} (B_{i}) .$

Moreover, the measure is normalized, μX (ℝ) = ℙ(X ∊ ℝ) = 1, so (ℝ, ℬ, μX) is in fact a probability space.

Since (ℝ, ℬ, μX) is a measure space, we can define a Lebesgue integral of a Borel function f(x) w.r.t. μX(x) over ℝ in a similar manner as the Lebesgue integral w.r.t. the measure m. For a simple function $φ (x) = \sum_{k = 1}^{n} a_{k} I_{A_{k}} (x),$ with Ak's as Borel sets in ℝ:

$\int_{ℝ} φ (x) d μ_{X} (x) : = \sum_{k = 1}^{n} a_{k} μ_{X} (A_{k}) . (9 .23)$

This is the analogue of (9.13). Then, the Lebesgue integral w.r.t. μX(x) over ℝ for any nonnegative f is defined as the supremum over Lebesgue integral values of all nonnegative simple functions that are less than or equal to f:

$\int_{ℝ} f (x) d μ_{X} (x) : = sup {\int_{ℝ} φ (x) d μ_{X} (x) : φ is a simple function, 0 \leq φ \leq f} . (9 .24)$

The Lebesgue integral, w.r.t. μX over ℝ, of any Borel function f = f+ − f−, is then given by the difference of the two nonnegative Lebesgue integrals:

$\int_{ℝ} f (x) d μ_{X} (x) = \int_{ℝ} f_{+} (x) d μ_{X} (x) - \int_{ℝ} f_{-} (x) d μ_{X} (x), (9 .25)$

and for any Borel set B in ℝ we have

$\int_{B} f (x) d μ_{X} (x) = \int_{ℝ} I_{B} (x) f (x) d μ_{X} (x) = \int_{ℝ} I_{B} (x) f_{+} (x) d μ_{X} (x) - \int_{ℝ} I_{B} (x) f_{-} (x) d μ_{X} (x) . (9 .26)$

Using MCT and the sequence of simple functions in (9.18), the Lebesgue integral of any nonnegative function f, w.r.t. μX over ℝ, is given by

$\int_{ℝ} f (x) d μ_{X} (x) = \lim_{n \to \infty} \int_{ℝ} f_{n} (x) d μ_{X} (x) = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} μ_{X} (A_{k}^{n}), (9 .27)$

where $A_{k}^{n} : = f^{- 1} ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})) .$

Based on the above construction, we have the following result, which gives the expected value of a random variable g(X) as a Lebesgue integral of the (ordinary) function g(x) w.r.t. the distribution measure of X over ℝ. Here we assume that g(X) is integrable, i.e., E[|g(X)|] < ∞.

Theorem 9.4.

Given a random variable X on (Ω, F, ℙ) and a Borel function g : ℝ → ℝ,

$E [g (X)] \equiv \int_{Ω} g (X (ω)) d ℙ (ω) = \int_{ℝ} g (x) d μ_{X} (x) . (9 .28)$

Proof. In many analysis textbooks we find a standard way to prove this by first showing that (9.28) follows trivially for the simplest case of a Boolean indicator function g(x) = ?A(x) and then by the linearity property of integrals the result is shown to hold for any simple function $g (x) = \sum_{k = 1}^{n} a_{k} I_{A_{k}} (x) .$ Equation (9.28) is then shown to hold for any nonnegative function by using MCT and finally it follows for any g(x) = g+(x) − g−(x). As an alternate proof, it is now instructive to see how (9.28) follows directly using (9.27) for nonnegative g with the sequence gn defined as in (9.18) with f replaced by g:

$\begin{array}{l} \int_{ℝ} g (x) d μ_{X} (x) = \lim_{n \to \infty} \int_{ℝ} g_{n} (x) d μ_{X} (x) = \lim_{n \to \infty} \sum_{k =0}^{2^{2 n}} \frac{k}{2^{n}} μ_{X} (A_{k}^{n}) \\ = \lim_{n \to \infty} \sum_{k =0}^{2^{2 n}} \frac{k}{2^{n}} ℙ (X^{- 1} (A_{k}^{n})) \\ = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} ℙ (g (X) \in [\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})) \\ \equiv \int_{Ω} g (X (ω)) d ℙ (ω) . \end{array}$

Here we used the definition in (9.21) for each set $A_{k}^{n} = g^{- 1} ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}}))$ and manipulated the set $X^{- 1} (A_{k}^{n}) \equiv {ω \in Ω : X (ω) \in A_{k}^{n}} = {ω \in Ω : g (X (ω)) \in g (A_{k}^{n})} = {ω \in Ω : g (X (ω)) \in [\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})} .$ Hence, (9.28) holds for both nonnegative parts g+ and g− of g, i.e., it must hold for any Borel function g.

The above expectation formula is useful if the integral on the right-hand side of (9.28) can be computed more explicitly. This is still in the form of a Lebesgue integral w.r.t. the measure μX. However, we can reduce this to more familiar forms depending on the type of random variable X.

In the simplest case of a constant random variable X ≡ a, the distribution measure is the Dirac measure, μX(·) = δa(·), i.e., for any Borel set B,

$δ_{a} (B) : = {\begin{matrix} 1 & if a \in B \\ 0 & if a \notin B . \end{matrix} (9 .29)$

In particular, δa({a}) = 1 and δa({x}) = 0 for x ≠ a. By (9.22), the CDF is simply FX(x) ≔ μX((−∞, x]) = δa((−∞, x]) = ?{x≥a}. The expected value as a Lebesgue integral w.r.t. ℙ is trivially given since E[g(X)] = g(a)ℙ(Ω) = g(a). According to (9.28), this value must equal

$E [g (X)] = \int_{ℝ} g (x) d δ_{a} (x) = g (a) . (9 .30)$

This gives us the formula for computing an integral w.r.t. the Dirac measure and is known as the sifting property since the Dirac measure picks out only the integrand value for x = a. Extending this to any purely discrete random variable that can take on distinct values ai with probability pi > 0, i = 1,2,..., $\sum_{i} p_{i} = 1,$ the distribution measure is a linear combination of Dirac measures at each point in the range of $X : μ_{X} (B) = \sum_{i} p_{i} δ_{a_{i}} (B) .$ . In particular, μX ({ai}) = pi. Then, using (9.30) and the fact that the integral w.r.t. a linear combination of measures is the linear combination of integrals w.r.t. each measure,

$E [g (X)] = \int_{ℝ} g (x) d μ_{X} (x) = \sum_{i} p_{i} \int_{ℝ} g (x) d δ_{a_{i}} (x) = \sum_{i} p_{i} g (a_{i}) . (9 .31)$

Using (9.22), the CDF is given as the piecewise constant (staircase function),

$F_{X} (x) = \sum_{i} p_{i} δ_{a_{i}} ((- \infty, x]) = \sum_{i} p_{i} I_{{x \leq a_{i}}}, (9 .32)$

with jump discontinuities only at points x = ai, i.e., FX(ai) − FX(ai−) = pi and FX(x) − FX(x−) = 0 for all x values not equal to any ai. This is the familiar form for the CDF of a purely discrete random variable. Observe that, if g is continuous at the points ai, the expectation of g(X) is equal to the Riemann-Stieltjes integral of g with FX as integrator:

$E [g (X)] = \int_{ℝ} g (x) d F_{X} (x) = \sum_{i} g (a_{i}) (F_{X} (a_{i} -)) = \sum_{i} g (a_{i}) p_{i} . (9 .33)$

We can also define a distribution measure for a random variable X given as a so-called mixture of random variables with each having its own distribution measure, i.e., $μ_{X} (B) = \sum_{i} p_{i} μ_{i} (B)$ where $p_{i} \geq 0, i = 1, 2, ..., \sum_{i} p_{i} = 1,$ and each μi is a distribution measure on (ℝ, ℬ). Then, the expected value of g(X) is given as a linear combination of Lebesgue integrals w.r.t. each measure μi:

$E [g (X)] = \int_{ℝ} g (x) d μ_{X} (x) = \sum_{i} p_{i} \int_{ℝ} g (x) d μ_{i} (x) . (9 .34)$

Let us now consider the application of Theorem 9.4 to the most common important case of a continuous random variable. This is the case where the random variable has a probability density function (PDF) fX(x). In this case there is a nonnegative integrable Borel function fX such that

$μ_{X} (B) = \int_{B} f_{X} (x) d m (x), for all Borel sets B, (9 .35)$

with CDF

$F_{X} (b) : = \int_{(- \infty, b]} f_{X} (x) d m (x) (9 .36)$

for all b ∊ ℝ. In this case, the CDF FX is continuous and hence has no jumps, i.e., FX(x) = FX(x−) = FX(x+) for all x. Moreover, FX is not just continuous but in fact an absolutely continuous function. The reader may wish to consult a textbook on real analysis to learn more about this technical detail. It suffices here to point out that FX is differentiable and its derivative is the PDF: $F_{X}^{'} \equiv f_{X} .$ The distribution measure is said to be absolutely continuous w.r.t. the Lebesgue measure m. In this case, X is an absolutely continuous random variable but we simply say that it is a continuous random variable.3 Based on (9.35) it is easy to prove, using similar steps as in the above proof of (9.28) combined with the linearity property of the Lebesgue integral w.r.t. m and where $μ_{X} (A_{k}^{n}) = \int_{A_{k}^{n}} f_{X} (x) d m (x),$ that the expectation is a Lebesgue integral of gfX w.r.t. m:

$E [g (X)] = \int_{- \infty}^{\infty} g (x) f_{X} (x) d m (x) . (9 .37)$

In most (and for our purposes essentially all) applications, the density fX (when it exists) is bounded and continuous (a.e.) on ℝ, i.e., the CDF is the Riemann integral of the PDF,

$F_{X} (b) := \int_{- \infty}^{b} f_{X} (x) d x, b \in ℝ . (9.38)$

Moreover, if g is continuous (a.e.) on ℝ, then (9.37) reduces to the familiar well-known formula for the expected value of g(X) as a Riemann integral:

$E [g (X)] = \int_{- \infty}^{\infty} g (x) f_{X} (x) d x, (9.39)$

assuming both g± are Riemann-integrable, $E [| g (X) |] \equiv \int_{- \infty}^{\infty} | g (x) | f_{X} (x) d x < \infty,$ i.e., E[g+] < ∞ and E[g−] < ∞.

The standard normal X ~ Norm(0,1) is an important example of a continuous random variable having positive Gaussian density $f_{X} (x) = n (x) := \frac{1}{\sqrt{2 π}} e^{- x^{2} / 2},$ for all real x. The distribution μX is absolutely continuous w.r.t. the Lebesgue measure. In fact, fX is bounded and continuous on ℝ with the CDF given by the Riemann integral:

$F_{X} (b) : = ℙ (X \leq b) = \int_{- \infty}^{b} n (x) d x \equiv N (b), b \in ℝ .$

Clearly, $N^{'} (x) = n (x)$ where $F_{X} (x) = N (x)$ is a proper CDF since it is monotonically increasing from $N (- \infty) = 0$ to $N (\infty) = 1$ .

We now wish to make one last connection of the expectation in (9.28) to the so-called Lebesgue—Stieltjes integral encountered in real analysis. For any type of random variable X, we then realize that E[g(X)], when g is continuous (a.e.), is simply a Riemann—Stieltjes integral of g with CDF FX as integrator. Beginning with (9.22), observe that for any semi-open interval (a,b] the probability ℙ(X ∊ (a,b]) is given equivalently by the distribution measure μX((a,b]) or the difference of the CDF values at the interval endpoints:

$μ_{X} ((a, b]) = μ_{X} ((- \infty, b]) - μ_{X} ((- \infty, a]) = F_{X} (b) - F_{X} (a) . (9.40)$

For any semi-open interval, $ℓ_{F_{X}} ((a, b]) := F_{X} (b) - F_{X} (a)$ defines its "length relative to FX." Since $ℓ_{F_{X}} ((a, c]) = ℓ_{F_{X}} ((a, b]) + ℓ_{F_{X}} ((b, c])$ for a < b < c, this length is additive (cumulative) for adjoining intervals. In the special case that FX(x) = x we recover the usual length b − a. In contrast to the usual length, an infinitesimal interval does not necessarily have length zero relative to FX since FX is a CDF that may have jump discontinuities. In fact, a singleton set has length equal to the size of the jump discontinuity of the CDF:

$ℓ_{F_{X}} ({x}) = \lim_{\in \to \infty} ℓ_{F_{X}} ((x - \in, x]) = F_{X} (x) - \lim_{\in \to \infty} F_{X} (x - \in) = F_{X} (x) - F_{X} (x -) .$

For a purely continuous random variable X there are no jumps so all points have zero such length, but for a random variable having a discrete part there is a nonzero length given by the PMF values pi = FX(xi) − FX(xi−) at the points corresponding to the countable set of discrete values {xi} of X. For the other types of intervals we have $ℓ_{F_{X}} ((a, b]) = F_{X} (b) - F_{X} (a -), ℓ_{F_{X}} ((a, b)) = F_{X} (b -) - F_{X} (a), ℓ_{F_{X}} ((a, b]) = F_{X} (b -) - F_{X} (a -) .$ .

Based on the above definition of the length ℓF, for a given CDF F, the definition of the Lebesgue measure m is generalized to the Lebesgue—Stieltjes measure generated by F. This measure is denoted by mF. The measure mF : ℝ, → [0, 1], defined for any Borel set B ∊ ℬ(ℝ), is the smallest total length relative to F of all countable unions of semi-open intervals in ℝ that contain B:

$m_{F} (B) := \inf {\sum_{n = 1}^{\infty} ℓ_{F} (I_{n}) : I_{n} = (a_{n}, b_{n}], a_{n} \leq b_{n,} B \subset \cup_{n = 1}^{\infty} I_{n}} . (9.41)$

Hence, mF is a measure that assigns a value in [0, 1] for each Borel set B, such that all semi-open intervals In = (an, bn] have measure mF((an, bn]) = ℓF(an, bn]) = F(bn) − F(an). All intervals, including semi-infinite intervals, have finite measure; in particular, mF(ℝ) = F(∞) − F(−∞) = 1− 0 = 1. This measure is countably additive on ℬ(ℝ):

$m_{F} (\cup_{n = 1}^{\infty} B_{n}) = \sum_{n = 1}^{\infty} m_{F} (B_{n})$

for all disjoint Borel sets Bn, n ≥ 1. Because of the equivalence relation in (9.40) and the fact that the σ-algebra generated by all semi-open intervals in ℝ, is ℬ(ℝ), the distribution measure is the same as the Lebesgue—Stieltjes measure generated by the CDF, i.e., $μ_{X} (B) = m_{F_{X}} (B)$ for every Borel set B. For example, X ≡ a has CDF FX(x) = ?{x ≥ a}. So, the measure $m_{F_{X}} = δ_{a}$ is the Dirac measure concentrated at a. For any purely discrete random variable, with CDF in (9.32), the measure $m_{F_{X}} = \sum_{i} p_{i} δ_{a_{i}}$ is a weighted sum of the Dirac measures for each point with its corresponding PMF value, i.e., $m_{F_{X}} (B) = \sum_{i} p_{i} δ_{a_{i}} (B) .$

The Lebesgue—Stieltjes integral of a function w.r.t. $m_{F_{X}}$ is defined in the same manner as in (9.23), (9.24), (9.25), and (9.26) with the notation that μX is replaced by $m_{F_{X}}$ . In particular, the expectation in (9.28) is now recognized as the Lebesgue-Stieltjes integral of g w.r.t. $m_{F_{X}}$ over ℝ, denoted by $\int_{ℝ} g (x) d m_{F_{X}} (x),$ where we write equivalently

$E [g (X)] = \int_{ℝ} g (x) d μ_{X} (x) \equiv \int_{ℝ} g (x) d m_{F_{X}} (x) . (9.42)$

If g is continuous (a.e.) then it can be shown that the Lebesgue-Stieltjes integral is the same as the corresponding Riemann-Stieltjes integral of g with CDF FX as integrator function:

$E [g (X)] = \int_{ℝ} g (x) d m_{F_{X}} (x) = \int_{ℝ} g (x) d F_{X} (x) . (9.43)$

Since any CDF is a monotone (nondecreasing) bounded function, then it is of bounded variation and hence can always be used as integrator. If X is absolutely continuous, then dFX(x) = F'X(x) dx = ƒX(x) dx and the Riemann-Stieltjes integral is the same as the usual Riemann integral in (9.39). For a purely discrete random variable with CDF in (9.32), the Riemann-Stieltjes integral in (9.43) is given by (9.33). The Riemann-Stieltjes integral in (9.43) also gives E[g(X)] for all other types of mixture random variables, as shown in Section 11.2.1, for example. Such random variables can have a discrete and continuous part (the continuous part being either absolutely continuous and/or singularly continuous).

Based on the above expectation formulas, one can also proceed to compute various quantities such as the moments E[Xn], n ≥ 1, of a real-valued random variable X (assuming E[|X|n] < ∞); the moment generating function MX(t) ≔ E[etX], which is either infinite or a function of the real parameter t on some interval of convergence about t = 0; the characteristic function $φ_{X} (t) := E [e^{i t X}], i \equiv \sqrt{- 1},$ which is bounded for all t ∊ ℝ, i.e., |φX(t)| ≤ E[|eitX|] = 1. The characteristic function (or the moment generating function) is useful for computing the mean and variance as well as various moments of a random variable. The relevant formulas and theorems related to these functions are part of standard material that is covered in most textbooks on probability theory and are hence (in the interest of space) simply omitted here.

In closing this section we mention one other general result, which is the change of variable formula for an expectation given in (9.44) just below. This allows us to compute the expectation E[g(X)] as an integral w.r.t. the distribution measure μY of the random variable defined by Y ≔ g(X). Note that, since X is a random variable on (Ω, F, ℙ), then Y is also a random variable on (Ω, F, ℙ). That is, g is a Borel function, g−1(B) ∊ ℬ, giving Y−1(B) = X−1(g−1(B)) ∊ F for every B ∊ ℬ. Assuming Y is integrable, i.e., E[|Y|] < ∞, then

$E [Y] := \int_{Ω} Y (ω) d ℙ (ω) = \int_{ℝ} g (x) d μ_{X} (x) = \int_{ℝ} y d μ_{Y} (y), (9.44)$

where μY(B) ≔ ℙ(Y—1(B)) ≡ ℙ(Y ∊ B), for every B ∊ ℬ. The proof of this formula follows readily from the relation between the two distribution measures: μY(B) ≔ ℙ(Y−1(B)) = ℙ(X−1 (g−1 (B))) = μX(g−1(B)). Hence, in the first equation line in the proof of Theorem 9.4. we have $μ_{X} (A_{k}^{n}) = μ_{X} (g^{- 1} ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}}))) = μ_{Y} ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})),$ , i.e.,

$\int_{ℝ} g (x) d μ_{X} (x) = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} μ_{X} (A_{k}^{n}) = \lim_{n \to \infty} \sum_{k = 0}^{2^{2 n}} \frac{k}{2^{n}} μ_{Y} ([\frac{k}{2^{n}}, \frac{k + 1}{2^{n}})) = \int_{ℝ} y d μ_{Y} (y)$

which proves the formula. In summary, we see that E[g(X)] can be evaluated in three different ways: (i) by integrating the random variable Y ≡ g(X) w.r.t. ℙ over Ω, (ii) by integrating the function g(x) w.r.t. distribution measure μX(x) over ℝ, or (iii) by integrating the function ƒ(y) = y w.r.t. distribution measure μY(y) over ℝ.

9.2 Multidimensional Lebesgue Integration

The above integration theory for Borel functions of a single variable (and random variables defined as functions of a single random variable) extends into the general multidimensional case. The construction of Lebesgue integrals mirrors the above single-variable case. We recall the Borel sets in ℝn, ℬn ≡ ℬ(ℝn) and Borel functions defined over ℝn in Section 6.2.2. In ℝ2, we denote the Lebesgue measure by m2 : ℬ2 → [0, ∞]. It can be defined formally as an extension of the definition given above for the Lebesgue measure m. Given any two intervals I1, I2, then m2 measures the area of the rectangle I1 × I2: m2(I1 × I2) = ℓ(I1)ℓ(I2). In terms of the Lebesgue measure in one dimension we have the product measure m2(I1 × I2) = m(I1)m(I2). Note that the null sets (having zero measure w.r.t. m2) include any countable union of points in ℝ2 as well as some uncountable sets of the form A × {b}, A ⊂ ℝ, b ∊ ℝ or {a} × B, a ∊ ℝ, B ⊂ ℝ. Also, any graph or curve in ℝ2 has zero m2 measure. The Lebesgue integral of a Borel function over ℝ2 w.r.t. measure m2 is defined in similar fashion to what we have above for the single variable case. A simple Borel function $φ (x, y) = \sum_{k = 1}^{n} a_{k} I_{A_{k}} (x, y), a_{k} \in ℝ,$ with all Ak ∊ ℬ2, has Lebesgue integral

$\int_{ℝ^{2}} φ (x, y) d m_{2} (x, y) := \sum_{k = 1}^{n} a_{k} m_{2} (A_{k}) . (9.45)$

The Lebesgue integral of any nonnegative Borel function ƒ is defined by

$\int_{ℝ^{2}} f (x, y) d m_{2} (x, y) := \sup {\int_{ℝ^{2}} φ (x, y) d m_{2} (x, y) : φ is a simple function, 0 \leq φ \leq f} . (9.46)$

The Lebesgue integral, w.r.t. m2 over ℝ2, of any Borel function ƒ = ƒ+ − ƒ−, with ƒ± ≥ 0, is then

$\int_{ℝ^{2}} f d m_{2} \equiv \int_{ℝ^{2}} f (x, y) d m_{2} (x, y) = \int_{ℝ^{2}} f_{+} (x, y) d m_{2} (x, y) - \int_{ℝ^{2}} f - (x, y) d m_{2} (x, y) . (9.47)$

ƒ is integrable iff ∫ℝ2 |ƒ| dm2 < ∞. We denote this by writing ƒ ∊ L1(ℝ2, ℬ2, m2). The Lebesgue integral of ƒ over any Borel set B ⊂ ℝ2 is the Lebesgue integral of ?B ƒ over ℝ2:

$\int_{B} f d m_{2} = \int_{ℝ^{2}} I_{B} (x, y) f (x, y) d m_{2} (x, y) . (9.48)$

Assuming an integrable function, ƒ ∊ L1(ℝ2, ℬ2, m2), then Fubini's Theorem can be applied for interchanging the order of integration:

$\int_{ℝ^{2}} f d m_{2} = \int_{ℝ} (\int_{ℝ} f (x, y) d m (x)) d m (y) = \int_{ℝ} (\int_{ℝ} f (x, y) d m (y)) d m (x) . (9.49)$

What is important for us is when ƒ is continuous (m2—a.e.) in (9.47) or ?B ƒ is continuous in (9.48). The Lebesgue integral in (9.47) is then equal to the Riemann (double) integral over ℝ2:

$\int_{ℝ^{2}} f d m_{2} = \iint_{ℝ^{2}} f (x, y) d x d y = \int_{ℝ} (\int_{ℝ} f (x, y) d x) d y = \int_{ℝ} (\int_{ℝ} f (x, y) d y) d x (9.50)$

where we assume ƒ ∊ L1(ℝ2, ℬ2, m2), i.e., the function is integrable. In all our applications this will be the case where, for fixed x ∊ ℝ, ƒ is continuous in y and for fixed y ∊ ℝ, ƒ is continuous in x. The set B in (9.48) is usually a rectangular region [a, b] × [c, d], or of type a ≤ x ≤ b,h1(x) ≤ y ≤ h2(x), or g1(y) ≤ x ≤ g2(y),c ≤ y ≤ d, etc. Given a set B = B1 × B2 = {(x, y) ∊ ℝ2 : x ∊ B1, y ∊ B2}, and assuming ?B ƒ is continuous and integrable on ℝ2, the Lebesgue integral in (9.48) is then

$\int_{B_{1} \times B_{2}} f d m_{2} = \int_{B_{2}} (\int_{B_{1}} f (x, y) d x) d y = \int_{B_{1}} (\int_{B_{2}} f (x, y) d y) d x . (9.51)$

We note that when the integrals only have meaning as Lebesgue integrals, we interpret the Riemann integrals as convenient shorthand notation for the corresponding Lebesgue integrals.

In ℝ3, the Lebesgue measure, m3,: ℬ3 → [0, ∞], measures the volume of I1 × I2 × I3, where m3(I1 × I2 × I3) = ℓ(I1)ℓ(I2)ℓ(I3). For any B = B1 × B2 × B3 ∊ ℬ3, we have the product measure m3(B) = m(B1)m(B2)m(B3). The null sets of m3 include any countable union of points in ℝ3 as well as uncountable sets of the form A × B ×{c}, A, B ⊂ ℝ,c ∊ ℝ, or A × {b} × C, A, C ⊂ ℝ, b ∊ ℝ, or {a} × B × C, B, C ⊂ ℝ, a ∊ ℝ, all surfaces and lines, etc.. The Lebesgue integral of a Borel function ƒ: ℝ3 → ℝ w.r.t. measure m3 over ℝ3 is defined in analogy with the above construction in ℝ2. If ƒ(x1, x2, x3) is integrable w.r.t. m3 over ℝ3 (denoted as ƒ ∊ L1(ℝ3, ℬ3, m3)) and is furthermore a continuous function (m3—a.e.) of the three variables, then its Lebesgue integral is a Riemann (triple) integral over ℝ3:

$\int_{ℝ^{3}} f d m_{3} = \int_{ℝ} \int_{ℝ} \int_{ℝ} f (x_{1}, x_{2}, x_{3}) d x_{1} d x_{2} d x_{3}, (9.52)$

where we can also change the order of integration by successive application of Fubini's Theorem. For a Borel set B = B1 × B2 × B3 = {(x1, x2, x3) ∊ ℝ3 : x1 ∊ B1, x2 ∊ B2, x3 ∊ B3} we have

$\begin{array}{l} \int_{B_{1} \times B_{2} \times B_{3}} f d m_{3} & = \int_{B_{3}} \int_{B_{2}} \int_{B_{1}} f (x_{1}, x_{2}, x_{3}) d m (x_{1}) d m (x_{2}) d m (x_{3}) \\ = \int_{B_{3}} \int_{B_{2}} \int_{B_{1}} f (x_{1}, x_{2}, x_{3}) d x_{1} d x_{2} d x_{3} \\ = \int_{ℝ^{3}} f (x_{1}, x_{2}, x_{3}) I_{B} (x_{1}, x_{2}, x_{3}) d x_{1} d x_{2} d x_{3}, \end{array} (9.53)$

where the Riemann integral is used as shorthand for the Lebesgue integral and is equivalent to it when ƒ(x1,x2,x3) ?B(x1,x2,x3) is a continuous function of x1,x2,x3.

More generally, in ℝn the Lebesgue measure, mn : Bn → [0, ∞], gives the n-dimensional volume of any n-dimensional cube: mn(I1 × ... × In) = ℓ(I1) × ... × ℓ(In). For every cartesian n-tuple B = B1 × ... × Bn ∊ ℬn, mn(B) = m(B1)m(B2) ··· m(Bn) is a product measure. Null sets of mn are sets having zero n-dimensional volume and these include any countable union of points in ℝn, hyperplanes, lines, etc. The Lebesgue integral of a Borel function ƒ: ℝn → ℝ w.r.t. mn over ℝn for all n ≥ 2 is constructed as in the above case of n = 2. If ƒ ∊ L1(ℝn, ℬn, mn), i.e., ƒ(x) ≡ ƒ(x1,...,xn) is integrable w.r.t. mn over ℝn, and is furthermore a continuous function (mn—a.e.) of the n variables x, then its Lebesgue integral is equal to its Riemann integral over ℝn:

${\int_{ℝ^{n}} f d m}_{n} = \int_{ℝ} ... \int_{ℝ} f (x_{1}, ..., x_{n}) d x_{1} ... d x_{n} . (9.54)$

For a Borel set B = B1 × ... × Bn = {x ∊ ℝn : x1 ∊ B1,...,xn ∊ Bn} we have

$\begin{array}{l} \int_{B_{1} \times ... \times B_{n}} f d m_{n} & = \int_{B_{n}} \dots \int_{B_{1}} f (x_{1}, \dots, x_{n}) d m (x_{1}) ... d m (x_{n}) \\ = \int_{B_{n}} \dots \int_{B_{1}} f (x_{1}, ..., x_{n}) d x_{1} ... d x_{n} \\ = \int_{ℝ^{n}} f (x) I_{B} (x) d^{n} x, \end{array} (9.55)$

where the Riemann integral is used as shorthand for the Lebesgue integral and is equivalent to it when ƒ(x)?B(x) is continuous on ℝn.

9.3 Multiple Random Variables and Joint Distributions

Let us now see how distributions and expectations are formulated for multiple random variables (i.e., random vectors) by first considering a pair of random variables (X, Y) : Ω → ℝ2 defined on the same probability space (Ω, ℱ, ℙ). The joint distribution measure

$μ_{X, Y} : ℬ_{2} \to [0, 1]$

is the measure induced by the pair (X, Y) and defined by

$μ_{X, Y} (ℬ) := ℙ ((X, Y) \in B), B \in ℬ_{2} . (9.56)$

This measure is countably additive and assigns a number in [0, 1] to a Borel set B in ℝ2 which corresponds to the probability of the event {(X, Y) ∊ B} ≡ {ω ∊ Ω: (X(ω), Y (ω)) ∊ B}. This measure is normalized so that μX,Y(ℝ2) = ℙ((X, Y) ∊ ℝ2) = 1 and so the measure space (ℝ2, ℬ2, μX,Y) is also a probability space. Writing B = B1 × B2, B1, B2 ∊ ℬ, then we see that

$μ_{X, Y} (B_{1} \times B_{2}) = ℙ (X^{- 1} (B_{1}) \cap Y^{- 1} (B_{2})) = ℙ (X \in B_{1}, Y \in B_{2}) (9.57)$

gives the probability of the joint event {X ∊ B1} ∩ {Y ∊ B2} ≡ {X ∊ B1, Y ∊ B2}. This joint measure determines the univariate (marginal distribution) measures of X and Y by letting B1 = ℝ or B2 = ℝ:

$μ_{X, Y} (B \times ℝ) = ℙ (X \in B, Y \in ℝ) = ℙ (X \in B) = μ_{x} (B), (9.58)$

$μ_{X, Y} (ℝ \times B) = ℙ (X \in ℝ, Y \in B) = ℙ (X \in B) = μ_{Y} (B), (9.59)$

for all B ∊ ℬ.

Letting B1 = (−∞, x],B2 = (−∞, y] in (9.57) gives the joint CDF of (X,Y):

$F_{X, Y} (x, y) := μ_{X, Y} ((- \infty, x] \times (- \infty, y]) = ℙ (X \leq x, Y \leq y), x, y \in ℝ . (9.60)$

This CDF is right-continuous on ℝ2, monotone in both x and y, and recovers the univariate (marginal) CDF of X or Y in the respective limits:

$\lim_{y \to \infty} F_{X, Y} (x, y) \equiv F_{X, Y} (x, \infty) = ℙ (X \leq x) = μ_{X} ((- \infty, x]) = F_{X} (x), x \in ℝ, (9.61)$

$\lim_{x \to \infty} F_{X, Y} (x, y) \equiv F_{X, Y} (\infty, y) = ℙ (Y \leq y) = μ_{Y} ((- \infty, y]) = F_{Y} (y), y \in ℝ . (9.62)$

Taking the limit of infinite argument in the marginal CDF (in either case) gives

$\lim_{x \to \infty, y \to \infty} F_{X, Y} (x, y) \equiv F_{X, Y} (\infty, \infty) = F_{X} (\infty) = F_{Y} (\infty) = 1.$

Taking a decreasing sequence of numbers xn ↘ −∞ gives a decreasing sequence of sets approaching the empty set: (−∞, xn] × (−∞, y] ↘ ø. By monotone continuity (from above) of the measure, μX,Y ((−∞, xn] × (−∞, y]) ↘ μX,Y(ø) = 0, i.e., for any y ∊ ℝ,

$F_{X, Y} (- \infty, y) \equiv \lim_{x \to - \infty} F_{X, Y} (x, y) = \lim_{x_{n} ↘ - \infty} μ_{X, Y} ((- \infty, x_{n}] \times (- \infty, y]) = μ_{X, Y} (0) = 0.$

Similarly, $_{y \to - \infty}^{\lim} F_{X, Y} (x, y) \equiv F_{X, Y} (x, - \infty) = 0, x \in ℝ$ . These two relations must clearly hold since X and Y are in ℝ and so ℙ(X < −∞, Y ≤ y) = 0 and ℙ(X ≤ x, Y < −∞) = 0.

Let x1, x2, y1, y2 be real numbers such that x1 < x2 and y1 < y2; then the joint measure of the semi-open rectangle (x1, x2] × (y1, y2] is given by

$\begin{array}{l} μ_{X, Y} ((x_{1}, x_{2}] \times (y_{1}, y_{2}]) = ℙ (x_{1} < X \leq x_{2}, y_{1} < Y \leq y_{2}) \\ = F_{X, Y} (x_{2}, y_{2}) - F_{X, Y} (x_{1}, y_{2}) - F_{X, Y} (x_{2}, y_{1}) + F_{X, Y} (x_{1}, y_{1}) . (9.63) \end{array}$

Based on this relation, the definition of a Lebesgue—Stieltjes measure for a single random variable, defined in (9.41), can be extended to a Lebesgue—Stieltjes measure generated by the joint CDF FX,Y of (X, Y). The quantity in (9.63) can be viewed as a measure of an "area relative to the joint CDF" FX,Y for any semi-open rectangle in ℝ2. The Lebesgue—Stieltjes measure generated by FX,Y, which we denote by $m_{F}^{X, Y},$ is the measure function $m_{F}^{X, Y} : ℝ^{2} \to [0, 1]$ , defined for any Borel set B = B1 × B2 ∊ ℬ2, that assigns the smallest total area relative to FX,Y (using (9.63)) of all countable unions of semi-open rectangles Ik × Jl ≡ (ak,bk] × (cl,dl], ak ≤ bk, cl ≤ dl in ℝ2 that contain B:

$m_{F}^{X, Y} (B) := \inf {\sum_{k = 1}^{\infty} \sum_{l = 1}^{\infty} μ_{X, Y} (I_{k} \times J_{l}) : B_{1} \subset \cup_{k = 1}^{\infty} I_{k}, B_{2} \subset \cup_{l = 1}^{\infty} J_{l}} . (9.64)$

This measure is equivalent to the joint distribution measure, i.e., $m_{F}^{X, Y} (B) = μ_{X, Y} (B)$ .

Since (ℝ2, B2, μX,Y) is a measure, we can define the Lebesgue integral w.r.t. μX,Y over ℝ2 (i.e., the Lebesgue—Stieltjes integral w.r.t. μX,Y or equivalently w.r.t. $m_{F}^{X, Y}$ ) in very similar manner as was done above for the Lebesgue—Stieltjes integral w.r.t. μX in (9.23) - (9.68). For any simple function

$φ (x, y) = \sum_{k = 1}^{K} \sum_{l = 1}^{L} a_{k, l} ?_{B_{1}^{k} \times B_{2}^{l}} (x)$

with $B_{1}^{k} \times B_{2}^{l} \in ℬ_{2}, a_{k, l} \in ℝ,$ , its Lebesgue integral w.r.t. μX,Y over ℝ2 is defined by

$\int_{ℝ^{2}} φ (x, y) d μ_{X, Y} (x, y) := \sum_{k = 1}^{K} \sum_{l = 1}^{L} a_{k, l} μ_{X, Y} (B_{1}^{k} \times B_{1}^{l}) . (9.65)$

Based on this definition, the Lebesgue—Stieltjes integral w.r.t. μX,Y over ℝ2 for any non-negative Borel function ƒ : ℝ2 → ℝ is defined as the supremum over integral values of all nonnegative simple functions φ ≤ ƒ:

$\int_{ℝ^{2}} f (x, y) d μ_{X, Y} (x, y) := \sup {\int_{ℝ^{2}} φ (x, y) d μ_{X, Y} (x, y) : φ is simple, 0 \leq φ \leq f} . (9.66)$

For any Borel function ƒ = ƒ+ − ƒ−, the Lebesgue—Stieltjes integral is given by the difference of the two nonnegative integrals:

$\int_{ℝ^{2}} f (x, y) d μ_{X, Y} (x, y) = \int_{ℝ^{2}} f_{+} (x, y) d μ_{X, Y} (x, y) - \int_{ℝ^{2}} f_{-} (x, y) d μ_{X, Y} (x, y), (9.67)$

and for any Borel set B in ℝ2 we have

$\int_{B} f (x, y) d μ_{X, Y} (x, y) = \int_{ℝ^{2}} ?_{B} (x, y) f (x, y) d μ_{X, Y} (x, y) . (9.68)$

We note that MCT is a general property that also applies to all Lebesgue—Stieltjes integrals.

Based on the above construction, we have the following result for the expected value of h(X, Y)(ω) ≡ h(X(ω),Y(ω)), defined as a Borel function of two random variables (X, Y). Here we assume that h(X, Y) is integrable, i.e., E[|h(X, Y)|] < ∞.

Theorem 9.5.

Given a pair of random variables (X, Y) on (Ω, ℱ, ℙ) and a Borel function h : ℝ2 → ℝ,

$E [h (X, Y)] \equiv \int_{Ω} h (X (ω), Y (ω)) d ℙ (ω) = \int_{ℝ^{2}} h (x, y) d μ_{X, Y} (x, y) . (9.69)$

The proof of (9.69) is very similar to the proof of (9.28). In the special case h(x, y) = g(x), (9.69) recovers (9.28). The Lebesgue—Stieltjes integral in (9.69) is a very general representation for E[h(X, Y)] where h is a, Borel function. That is, X or Y can be any type of random variable, i.e., any combination of discrete, absolutely continuous, or singularly continuous. The expectation in (9.69) reduces to various useful and familiar formulas for E[h(X, Y)] that a student learns in a standard course in probability theory. For our purpose, there are two main cases: discrete or continuous (we simply say continuous to mean absolutely continuous).

Assume that h is a, continuous function (m2—a.e.). This is virtually always the case in practice and certainly the case for all our applications in this text. The Lebesgue—Stieltjes is then a Riemann—Stieltjes integral over ℝ2. Let us consider the simple case where both X and Y are discrete random variables and h(x, y) is continuous at all values (x, y) = (xi, yj) in the range of (X, Y); then the Riemann—Stieltjes integral simply recovers the summation formula in the joint PMF pX,Y (x, y) ≡ ℙ(X = x, Y = y) of (X, Y) at the support values:

$E [h (X, Y)] = \sum_{all x_{i}} \sum_{all y_{j}} p_{X, Y} (x_{i}, y_{j}) h (x_{i}, y_{j})$

assuming E[|h(X, Y)|] < ∞ (i.e., summation converging for both negative and positive parts of h). Letting h(X, Y) = ?{X ≤ x,Y ≤ y}, for any fixed real values (x, y), recovers the joint CDF as a (two-dimensional piecewise constant) staircase function in (x, y):

$F_{X, Y} (x, y) = ℙ (X \leq x, Y \leq y) = E [?_{{X \leq x, Y \leq y}}] = \sum_{x_{i} \leq x} \sum_{y_{j} \leq y} p_{X, Y} (x_{i}, y_{j})$

with jump discontinuities at only the support values (x, y) = (xi, yj) of the PMF. This recovers the formula in (9.32) when y → ∞.

Let us now consider the case where (X, Y) are continuous with joint density denoted by ƒX,Y. In this case, every Borel set B ∊ ℬ2 has joint measure given by the Lebesgue integral of a nonnegative integrable Borel function ƒX,Y :ℝ2→ ℝ (namely, the joint PDF) over B:

$μ_{X, Y} (B) = \int_{B} f_{X, Y} (x, y) d m_{2} (x, y) \equiv \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} ?_{B} (x, y) f_{X, Y} (x, y) d x d y . (9.70)$

Recall that we sometimes simply write the Lebesgue integral as a Riemann integral using the convention we adopted in the previous section. Of course, the two are equal if ?B ƒX,Y is a continuous function in (x, y). Since (9.70) holds for all B ∊ ℬ2, then it holds for all sets of the form B = (−∞, a] × (−∞, b]. Using ?B(x, y) = ?{x ≤ a,y≤b} and the definition (9.60) we have the joint CDF

$F_{X, Y} (a, b) = \int_{- \infty}^{b} \int_{- \infty}^{a} f_{X, Y} (x, y) d x d y, a, b \in ℝ . (9.71)$

Since FX,Y (∞, ∞) = 1, then ƒX,Y integrates to unity on all of ℝ2. In fact, (9.71) holds iff (9.70) holds for all Borel sets B ∊ ℬ2. The joint CDF is continuous on ℝ2 and related to the joint PDF by differentiating (9.71),

$f_{X, Y} (x, y) = \frac{\partial^{2}}{\partial_{x} \partial_{y}} F_{X, Y} (x, y), x, y \in ℝ .$

The marginal CDF of X and Y are given by (9.61)—(9.62) and taking either limit a → ∞ or b → ∞ in (9.71) gives

$F_{X} (a) = \int_{- \infty}^{a} (\int_{- \infty}^{\infty} f_{X, Y} (x, y) d y) d x and F_{Y} (b) = \int_{- \infty}^{b} (\int_{- \infty}^{\infty} f_{X, Y} (x, y) d x) d y,$

for all a, b ∊ ℝ. Hence, the existence of the joint PDF ƒX,Y implies the existence of the respective marginal densities of X and Y:

$f_{X} (x) = \int_{- \infty}^{\infty} f_{X, Y} (x, y) d y and f_{Y} (y) = \int_{- \infty}^{\infty} f_{X, Y} (x, y) d x . (9.72)$

We note that the converse is generally not true. Recall from our discussion of a single random variable, the marginal densities are nonnegative Borel functions that exist whenever (see (9.35) and (9.36))

$μ_{X} (B) = \int_{B}^{} f_{X} (x) d x and μ_{Y} (B) = \int_{B}^{} f_{Y} (y) d y$

for all Borel sets B ⊂ ℝ, or equivalently whenever

$F_{X} (a) = \int_{- \infty}^{a} f_{X} (x) d x and F_{Y} (b) = \int_{- \infty}^{b} f_{Y} (y) d y$

for all a, b ∊ ℝ. The expectations E[h1(X)] and E[h2(Y)], for single-variable Borel functions h1 and h2, are therefore given by the respective Riemann (Lebesgue) integrals over ℝ:

$E [h_{1} (X)] = \int_{ℝ} h_{1} (x) f_{X} (x) d x and E [h_{2} (Y)] = \int_{ℝ} h_{2} (y) f_{Y} (y) d y . (9.73)$

For jointly continuous (X, Y), (9.70) holds, and it is readily proven (in the same manner that (9.37) or (9.39) is proven) that the expected value in (9.69) is given by the integral over ℝ2 of the joint PDF multiplied by h, i.e.,

$E [h (X, Y)] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} h (x, y) f_{X, Y} (x, y) d x d y . (9.74)$

Recall from Definition 6.13 that X and Y are mutually independent if

$ℙ (X \in B_{1}, Y \in B_{2}) = ℙ (X \in B_{1}) ℙ (Y \in B_{2}) (9.75)$

for all B1, B2 ∊ ℬ(ℝ). That is, for a Borel rectangle B = B1 × B2 the joint distribution measure given by (9.57) is now a product of the marginal distribution measures:

$μ_{X, Y} (B_{1} \times B_{2}) = ℙ (X \in B_{1}) ℙ (Y \in B_{2}) = μ_{X} (B_{1}) μ_{Y} (B_{2}) := μ_{X \times Y} (B_{1} \times B_{2}) . (9.76)$

Hence, (9.75) and (9.76) are equivalent. From (9.60) we also have that independence is equivalent to

$F_{X, Y} (x, y) = F_{X} (x) F_{Y} (y), x, y \in ℝ . (9.77)$

Moreover, two continuous random variables (X, Y) are independent if and only if their joint PDF is the product of the marginal PDFs,

$f_{X, Y} (x, y) = f_{X} (x) f_{Y} (y), x, y \in ℝ . (9.78)$

This is easily proven. In particular, assuming (X, Y) are independent, then (9.71) gives

$\begin{array}{l} \int_{- \infty}^{b} \int_{- \infty}^{a} f_{X, Y} (x, y) d x d y = F_{X, Y} (a, b) = F_{X} (a) F_{Y} (b) \\ = \int_{- \infty}^{a} f_{X} (x) d x \int_{\infty}^{b} f_{Y} (y) d y = \int_{- \infty}^{b} \int_{- \infty}^{a} f_{X} (x) f_{Y} (y) d x d y \end{array}$

for all a, b ∊ ℝ. This implies ƒX,Y (x, y) = ƒx(x)ƒY(y). We leave the proof of the converse as an exercise for the reader. When (X, Y) are independent, the general expectation formula for all types of random variables, as given by (9.69), is now a Lebesgue—Stieltjes integral w.r.t. the above product measure μX × Y:

$E [h (X, Y)] = \int_{ℝ^{2}} h (x, y) d μ_{X \times Y} (x, y) = \int_{ℝ} \int_{ℝ} h (x, y) d μ_{X} (x) d μ_{Y} (y), (9.79)$

where the order of integration in μX and μY is interchangeable according to Fubini's Theorem. In the case that h(X,Y) = h1(X)h2(Y),

$E [h (X, Y)] = (\int_{ℝ} h_{1} (x) d μ_{X} (x)) (\int_{ℝ} h_{2} (y) d μ_{Y} (y)) = E [h_{1} (X)] E [h_{2} (Y)] . (9.80)$

Of course, for continuous (X, Y) this product of expectations is given by (9.73). Taking h1(x) = x, h2(y) = y, h(x, y) = xy shows that two mutually independent random variables have zero covariance

$Cov (X, Y) \equiv E [X Y] - E [X] E [Y] = 0. (9.81)$

The converse is generally not true.

An important example of a jointly continuous random vector (X, Y) is the standard bivariate normal distribution where E[X] = E[Y] = 0, Cov(X, Y) = ρ, |ρ| < 1. The well-known joint PDF is

$f_{X, Y} (x, y) = n_{2} (x, y; ρ) := \frac{1}{2 π \sqrt{1 - ρ^{2}}} \exp (- \frac{x^{2} + y^{2} - 2 ρ x y}{2 (1 - ρ^{2})}), x, y \in ℝ . (9.82)$

The joint distribution measure μX,Y is absolutely continuous w.r.t. Lebesgue measure m2; i.e., for all B ∊ ℬ2 we have

$μ_{X, Y} (B) = \int_{B} n_{2} (x, y; ρ) d m_{2} (x, y) \equiv \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} ?_{B} (x, y) n_{2} (x, y; ρ) d x d y . (9.83)$

The joint CDF is

$F_{X, Y} (a, b) = N_{2} (x, y; ρ) := \int_{- \infty}^{b} \int_{- \infty}^{a} n_{2} (x, y; ρ) d x d y, a, b \in ℝ . (9.84)$

The functions n2 and $N_{2}$ denote the standard bivariate normal PDF and CDF, respectively, where $n_{2} (x, y; ρ) = \frac{\partial^{2}}{\partial x \partial y} N_{2} (x, y; ρ) .$ We note also the symmetry: $n_{2} (x, y, ρ) = n_{2} (y, x, ρ)$ and $N_{2} (x, y, ρ) = N_{2} (y, x, ρ)$ . The marginal CDFs of X and Y are the standard normal CDF and follow simply from the limiting values of the joint CDF (see (9.61)—(9.62)) :

$F_{X} (x) = F_{X, Y} (x, \infty) = N_{2} (x, \infty; ρ) = N (x), F_{Y} (y) = F_{X, Y} (\infty, y) = N_{2} (\infty, y; ρ) = N (y) .$

Here we used the integral definition of $N_{2}$ in (9.84). Hence, X and Y are identically distributed Norm(0, 1) random variables with standard normal (marginal) PDF

$f_{X} (x) = N^{'} (x) = n (x), f_{Y} (y) = N^{'} (y) = n (y),$

$n (z) := \frac{1}{\sqrt{2 π}} e^{- z^{2} ⁄ 2}$ , z ∊ ℝ. [Note that these marginal PDFs also follow by integrating the joint PDF according to (9.72).] We observe that the pair (X, Y) is mutually independent if and only if the correlation coefficient ρ = 0, i.e.,

$f_{X, Y} (x, y) = n_{2} (x, y; 0) = \frac{1}{2 π} e^{- (x^{2} + y^{2}) / 2} = n (x) n (y) = f_{X} (x) f_{Y} (y), x, y \in ℝ,$

and the integral in (9.84) factors into

$F_{X, Y} (a, b) = N_{2} (a, b; 0) = N (a) N (b) = F_{X} (a) F_{Y} (b), a, b \in ℝ .$

Substituting the above joint PDF into (9.74) gives the expectation of a Borel function of the normal pair (X, Y) as

$E [h (X, Y)] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} h (x, y) n_{2} (x, y; ρ) d x d y . (9.85)$

For ρ = 0, n2{x, y; 0) = n(x)n(y), and for h(x, y) = h1(x)h2(y) this expectation reduces to (9.80), where

$\begin{array}{l} E [h_{1} (X)] = \int_{ℝ}^{} h_{1} (x) d μ_{X} (x) = \int_{ℝ}^{} h_{1} (x) n (x) d x, \\ E [h_{2} (Y)] = \int_{ℝ}^{} h_{2} (x) d μ_{Y} (y) = \int_{ℝ}^{} h_{2} (y) n (y) dy . \end{array}$

The above formulation extends to the more general case of an n-dimensional real-valued random vector X = (X1,..., Xn) ∊ ℝn, for all integers n ≥ 1. Each Xi is a random variable on (Ω, ℱ, ℙ) where $X_{i}^{- 1} (B_{i}) \in ℱ$ for every Bi ∊ ℬ(ℝ), i = 1,..., n. As a random vector X : Ω → ℝn, for every Borel set B = B1 × ... × Bn ∊ ℬn,

$X^{- 1} (B) \equiv {X \in B} \equiv {X_{1} \in B_{1}, ..., X_{n} \in B_{n}} \in ℱ .$

The joint distribution measure of X = (X1,..., Xn), which generalizes (9.56), is defined by

$μ_{X} (B) \equiv μ_{X_{1}, ..., X_{n}} (B) := ℙ (X \in B) \equiv ℙ (X_{1} \in B_{1}, ..., X_{n} \in B_{n}) . (9.86)$

This measure assigns a probability to a Borel set B in ℝn which corresponds to the probability of the joint event {ω ∊ Ω: X1(ω) ∊ B1,... ,Xn(ω) ∊ Bn} = {X1 ∊ B1}∩...∩{Xn ∊ Bn}. It is normalized, μX(ℝn) = ℙ(X ∊ ℝn) = 1, so (ℝn, Bn, μX) is a probability space.

The joint (n-dimensional) measure μX determines all the univariate, bivariate, trivariate, etc., distribution measures for all single random variables Xi, pairs (Xi, Xj), triples (Xi, Xj, Xk), etc. This follows by setting some of the appropriate sets among B1,..., Bn equal to ℝ. For example, setting all sets Bj = ℝ, for all j ≠ i, and Bi = A ∊ ℬ, gives the univariate marginal distribution measures

$μ_{X} (B_{1} \times ... \times B_{n}) = ℙ (X_{i} \in A) = μ_{X_{i}} (A), i = 1, ..., n .$

The bivariate (marginal) distribution measure of a pair (Xi, Xj), i < j, is obtained by setting all sets Bk = ℝ, for all k ≠ i, k ≠ j, and Bi = A ∊ ℬ, Bj = B ∊ ℬ:

$μ_{X} (B_{1} \times ... \times B_{n}) = ℙ (X_{i} \in A, X_{j} \in B) = μ_{X_{i}, X_{j}} (A \times B), i < j = 1, ..., n .$

Letting B1 = (−∞,x1], B2 = (−∞, x2], . . ., Bn = (−∞, xn] in (9.86) gives the multivariate joint CDF of (X1,..., Xn):

$F_{X} (x) \equiv F_{(X_{1}, ..., X_{n})} (x_{1}, ..., x_{n}) = ℙ (X_{1} \leq x_{1}, ..., X_{n} \leq x_{n}), (9.87)$

x = (x1... ,xn) ∊ ℝn. This CDF is right-continuous on ℝn and is a nondecreasing monotonic function in all variables x1,..., xn. The marginal CDFs of each Xi are recovered in the limit that xj → ∞, for all j ≠ i in (9.87):

$F_{X_{i}} (x_{i}) = ℙ (X_{i} \leq x_{i}) = F_{(x_{1}, ..., x_{n})} (\infty, ..., \infty, x_{i}, \infty, ..., \infty), x_{i} \in ℝ . (9.88)$

Similarly, the (marginal) joint CDF for each random vector pair (Xi, Xj), i < j, is obtained by letting xk → ∞ for all k ≠ i, k ≠ j:

$F_{X_{i}, X_{j}} (x_{i}, x_{j}) \equiv ℝ (X_{i} \leq x_{i}, X_{j} \leq x_{j}) = \lim_{a l l x_{k} \to \infty; k \neq i, k \neq j} F_{(X_{1}, ..., X_{n})} (x_{1}, ..., x_{n}) .$

All other (marginal) joint CDFs are obtained in the appropriate limits. For example, we can consider any k-dimensional random vector such as (X1,...,Xk), for any 1 ≤ k ≤ n, having joint CDF

$F_{X_{1}, ..., X_{k}} (x_{1}, ..., x_{k}) = F_{(X_{1}, ..., X_{n})} (x_{1}, ..., x_{k}, \infty, ..., \infty) .$

More generally, the joint CDF $F_{X_{i_{1}}, ..., X_{i_{k}}} (y_{1}, ..., y_{k})$ , where (y1,..., yk) ∊ ℝk, of any k-dimensional random vector $(X_{i_{1}}, ..., X_{i_{k}})$ , 1 ≤ i1 < ... < ik ≤ n, 1 ≤ k ≤ n, taken from (X1,..., Xn) is obtained by setting xj = ∞ for all $j \notin {i_{1}, i_{2}, ..., i_{k}}$ in the n-dimensional joint CDF $F_{(X_{1}, ..., X_{n})} (x_{1}, ..., x_{n})$ . This corresponds to the probability

$F_{X_{i_{1}}, ..., X_{i_{k}}} (y_{1}, ..., y_{k}) = ℙ (X_{i_{1}} \leq y_{1}, ..., X_{i_{k}} \leq y_{k}) . (9.89)$

As in the two-dimensional case, the joint CDF evaluates to zero when setting any one of its arguments to −∞. Setting all xi = ∞ gives unity: FX(∞,..., ∞) = ℙ(X ∊ ℝn) = 1.

The random vector $Y= (Y_{1}, . . ., Y_{k}) := (X_{i_{1}}, . . ., X_{i_{k}}),$ for any 1 ≤ k ≥ n, has a joint distribution measure defined by

$μ_{Y} (B_{1} \times ... \times B_{k}) = ℙ (X_{i_{1}} \in B_{1}, ..., X_{i_{k}} \in B_{k})$

for any Borel set B1 × ... × Bk ∊ ℬk ≡ ℬ(ℝk). Hence, (ℝk, ℬk, μY) is a probability space for each k = 1,..., n.

The relation in (9.63) can be extended to any n-dimensional semi-open rectangle with the use of the n-dimensional joint CDF FX(x). Moreover, the Lebesgue—Stieltjes measure defined in (9.64) can be extended into n dimensions accordingly, as generated by FX. In fact, the n-dimensional joint distribution measure μX in (9.86) is the same Lebesgue—Stieltjes measure on ℝn. The above construction of the Lebesgue—Stieltjes integral w.r.t. μX,Y (for dimension n = 2), provided by (9.65),(9.66), (9.67), and (9.68), extends in the obvious manner into dimension n ≥ 2. We write the Lebesgue—Stieltjes integral of a Borel function ƒ: ℝn → ℝ w.r.t. the joint distribution measure μX as the difference of two nonnegative integrals

$\int_{ℝ^{n}} f (X) d μ_{X} (x) = \int_{ℝ^{n}} f_{+} (x) d μ_{X} (x) - \int_{ℝ^{n}} f_{-} (x) d μ_{X} (x), (9.90)$

and for any Borel set B in ℝn we have

$\int_{B} f (x) d μ_{X} (x) = \int_{ℝ^{n}} I_{B} (x) f (x) d μ_{X} (x) . (9.91)$

Here, $f (x) \equiv f (x_{1}, ..., x_{n}), d μ_{X} (x) \equiv d μ_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n})$ is shorthand vector notation.

Given a Borel function, h : ℝn → ℝ, of a random vector X = (X1,...,Xn) on a probability space (Ω, ℱ, ℙ), Theorem 9.5 is generalized to give the expected value of the random variable h(X) ≡ h(X1,..., Xn) as an integral of h(x) = h(x1,..., xn) w.r.t. the joint distribution measure μX over ℝn:

$E [h (X)] \equiv \int_{Ω} h (X (ω)) d ℙ (ω) = \int_{ℝ^{n}} h (x) d μ_{X} (x) . (9.92)$

This formula can be proven in the same manner as the proof of (9.69). This Lebesgue—Stieltjes integral is a general representation for the expected value E[h(X)] where h is a, Borel function on ℝn. So, each component of the vector (X1,...,Xn) can be any type of random variable, i.e., any combination of discrete, absolutely continuous, or singularly continuous random variables. The two main types of random variables of interest to us are either discrete or continuous (i.e., absolutely continuous).

The case where all Xi are discrete (as in the binomial and multinomial financial models considered in previous chapters) simply generalizes the above double summation formulas in the case of two variables to multiple (n-fold) summation formulas involving the joint PMF $p_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) \equiv ℙ (X_{1} = x_{1}, ..., X_{n} = x_{n})$ at the support values:

$E [h (X_{1}, ..., X_{n})] = \sum_{all x_{1}} ... \sum_{all x_{n}} p_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) h (x_{1}, ..., x_{n}) .$

Here we assume E[|h(X1,...,Xn)|] < ∞ (i.e., we assume the sums converge for both negative and positive parts of h). Choosing the indicator function $h (X) = I_{{X_{1} \leq a_{1}, ..., X_{n} \leq a_{n}}}$ recovers the joint CDF:

$F_{(X_{1}, ..., X_{n})} (a_{1}, ..., a_{n}) = E [I_{{X_{1} \leq a_{1}, ..., X_{n} \leq a_{n}}}] = \sum_{x_{1} \leq a_{1}} ... \sum_{x_{n} \leq a_{n}} p_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) .$

This is a (n-dimensional piecewise constant) staircase function in the variables (a1,..., an) with jump discontinuities occurring at only the support values of the PMF.

The most important case for continuous random variables is when (X1,... ,Xn) are continuous with joint density ƒx1,...,xn(x1...,xn) ≡ fX(x) as a nonnegative integrable Borel function fX : ℝn → ℝ, i.e., when every Borel set B ∊ ℬn has joint measure given by the Lebesgue integral of the joint density over B:

$μ_{X} (B) = \int_{B} f_{X} d m_{n} \equiv \int_{ℝ^{n}} I_{B} (x) f_{X} (x) d^{n} x . (9 .93)$

This is the generalization of (9.70), where the Lebesgue integral is written as a Riemann integral on ℝn. The Lebesgue and Riemann integrals are equal if ?BƒX is (mn—a.e.) continuous on ℝn. The joint CDF is obtained by setting B = (—∞, x1] × ... × (−∞, xn], where $I_{B} (y_{1}, ..., y_{n}) = I_{{y_{1} \leq x_{1}, y_{n} \leq x_{n}}}$ :

$F_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) = \int_{- \infty}^{x_{n}} ... \int_{- \infty}^{x_{1}} f_{X_{1}, ..., X_{n}} (y_{1}, ..., y_{n}) d_{y_{1}} ... d_{y_{n}}, (9.94)$

for all x1,...,xn ∊ ℝ. Note that the joint PDF ƒx integrates to unity since μx(ℝn) = ℙ(X ∊ ℝn) = 1. As we proved for n = 2, the relation in (9.94) is equivalent to (9.93). The joint CDF is continuous on ℝn and related to the joint PDF by differentiation,

$f_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) = \frac{\partial^{n}}{\partial_{x_{1}} ... \partial_{x_{n}}} F_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}), x_{1}, ..., x_{n} \in ℝ . (9.95)$

Note that (9.95) implies the existence of all marginal PDFs (densities) for all univariate Xi, bivariate (Xi, Xj), etc. In particular, all k-dimensional random vectors $(X_{i_{1}, ...,} X_{i_{k}})$ , 1 ≤ i1 < ... < ik ≤ n, 1 ≤ k ≤ n, have CDF as in (9.89). Using this relation in (9.94) gives us the joint (marginal) PDF of $(X_{i_{1}, ...,} X_{i_{k}})$ as an (n — k)-dimensional integral of the joint PDF over ℝn—k in the integration variables yj, for all $j \notin {i_{1}, i_{2}, ..., i_{k}}$ . For example, the CDF of the random vector consisting of the first k variables, (X1,..., Xk), is

$F_{X_{1}, ..., X_{k}} (x_{1}, ..., x_{k}) = F_{X_{1}, ..., X_{k}, X_{k + 1}, ..., X_{n}} (x_{1}, ..., x_{k}, \infty, ..., \infty) (9.96)$

$= \int_{- \infty}^{x_{k}} ... \int_{- \infty}^{x_{1}} (\int_{- \infty}^{\infty} ... \int_{- \infty}^{\infty} f_{X_{1}, ..., X_{n}} (y_{1}, ..., y_{n}) d y_{k + 1} ... d y_{n}) d y_{1} ... d y_{k} . (9.97)$

The (n — k)-dimensional (inner) integral is the joint PDF of (X1,... ,Xk). This is an integrable Borel function ƒx1,...,xk: ℝk → R, given by

$f_{X_{1}, ..., X_{k}} (x_{1}, ..., x_{k}) = \int_{- \infty}^{\infty} ... \int_{- \infty}^{\infty} f_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{k}, x_{k + 1}, ..., x_{n}) d x_{k + 1} ... d x_{n} .$

Hence, for every k = 1,..., n, we have the marginal CDF and PDF relations:

$F_{X_{1}, ..., X_{k}} (x_{1}, ..., x_{k}) = \int_{- \infty}^{x_{k}} ... \int_{- \infty}^{x_{1}} f_{X_{1}, ..., X_{k}} (y_{1} ... y_{k}) d y_{1} ... d y_{k}, (9.98)$

and

$f_{X_{1}, ..., X_{k}} (x_{1}, ..., x_{k}) = \frac{\partial^{k}}{\partial_{k_{1}} ... \partial_{x_{k}}} F_{X_{1}, ... X_{k}} (x_{1}, ..., x_{k}), x_{1}, ..., x_{k} \in ℝ . (9.99)$

Based on (9.93), it can be proven that the expectation formula in (9.92) takes the form of an integral over ℝn involving the joint PDF:

$\begin{array}{l} E [h (X)] & = \int_{ℝ^{n}} h (x) f_{X} (x) d^{n} x \\ \equiv \int_{- \infty}^{\infty} ... \int_{- \infty}^{\infty} h (x_{1}, ..., x_{n}) f_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) d x_{1} ... d x_{n} . (9.100) \end{array}$

Note that (9.74) is a special case of this formula for n = 2 dimensions. All marginal CDFs in (9.89) are also conveniently expressed as expectations of indicator functions where

$F_{X_{i_{1}}, ..., X_{i_{k}}} (y_{1}, ..., y_{k}) = ℙ (X_{i_{1}} \leq y_{1}, ..., X_{i_{k}} \leq y_{k}) = E [I_{X_{i_{1}} \leq y_{1}, ..., X_{i_{k}} \leq y_{k}}] .$

It is convenient to define $Y = (Y_{1}, ..., Y_{k}) := (X_{i_{1}, ...,} X_{i_{k}})$ . Now, differentiating all k arguments of the (marginal) joint CDF gives the (marginal) joint PDF for a continuous random vector Y,

$f_{Y_{1}, ..., Y_{k}} (y_{1}, ..., y_{k}) = \frac{\partial^{k}}{\partial_{y_{1}} ... \partial_{y_{k}}} F_{Y_{1}, ..., Y_{k}} (y_{1}, ..., y_{k}) .$

Hence, if h: ℝk → ℝ is a Borel function of only $(X_{i_{1}, ...,} X_{i_{k}}) \equiv (Y_{1}, ..., Y_{k})$ , with 1 ≤ k ≤ n components from X, (9.100) reduces to a k-dimensional integral involving the (marginal) joint PDF of Y:

$E [h (Y_{1}, ..., Y_{k})] = \int_{- \infty}^{\infty} ... \int_{- \infty}^{\infty} h (y_{1}, ..., y_{k}) f_{Y_{1}, ..., Y_{k}} (y_{1}, ..., y_{k}) d y_{1} ... d y_{k} . (9.101)$

Based on (9.101), and choosing appropriate functions for h, we can in principle compute several quantities of interest, such as moments, product moments, joint moment generating functions, joint characteristic functions, etc., as long as the integrals exist. In particular, the covariance between any two continuous random variables in X, say Xi and Xj, is computed by making use of the joint PDF of the pair (Xi, Xj) and the marginal densities of Xi and Xj:

$\begin{array}{r} Cov (X_{i}, X_{j}) & := E [X_{i} X_{j}] - E [X_{i}] E [X_{j}] \\ = \int_{ℝ} \int_{ℝ} x y f_{X_{i}, X_{j}} (x, y) d x d y - (\int_{ℝ} x f_{X_{i}} (x) d x) (\int_{ℝ} y f_{X_{j}} (y) d y) . (9.102) \end{array}$

Let us now consider the case where X1,...,Xn are independent. By property 1 of Definition 6.14, it follows that the joint distribution measure in (9.86) is now a product measure on ℝn:

$μ_{X_{1}, ..., X_{n}} (B) = \prod_{i = 1}^{n} ℙ (X_{i} \in B_{i}) = \prod_{i = 1}^{n} μ_{X_{i}} (B_{i}) := μ_{X_{1} \times ... \times X_{n}} (B) (9.103)$

for all Borel sets B = B1 × ... × Bn in ℝn. The joint CDF is then the product of marginal CDFs,

$F_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) = \prod_{i = 1}^{n} ℙ (X_{i} \leq x_{i}) = \prod_{i = 1}^{n} F_{X_{i}} (x_{i}) . (9.104)$

For continuous random variables then, by differentiating (9.104) according to (9.95), the joint PDF is the product of marginal densities

$f_{X_{1}, ..., X_{n}} (x_{1}, ..., x_{n}) = \prod_{i = 1}^{n} f_{X_{i}} (x_{i}) . (9.105)$

In fact, it can be shown that (9.105) and (9.104) are equivalent in the case of continuous random variables.

The expectation formula in (9.79) extends to n dimensions,

$E [h (X)] = \int_{ℝ} ... \int_{ℝ} h (x) d μ_{X_{1}} (x_{1}) ... d μ_{X_{n}} (x_{n}), (9.106)$

where the order of integration is interchangeable according to Fubini's Theorem. Similarly, in the case where h(X) = h1(X1)h2(X2) . . . h(Xn), (9.80) extends to a product of n expectations:

$E [h_{1} (X_{1}) h_{2} (X_{2}) ... h (X_{n})] = \prod_{i = 1}^{n} \int_{ℝ} h_{i} (x_{i}) d μ_{X_{i}} (x_{i}) = \prod_{i = 1}^{n} E [h_{i} (X_{i})], (9.107)$

where we assume that all product functions are integrable, E[|hi(Xi)|] ≤ ∞, i = 1,...,n. For continuous random variables we have the usual formula for the expectation involving the marginal densities,

$E [h_{1} (X_{1}) h_{2} (X_{2}) ... h (X_{n})] = \prod_{i = 1}^{n} E [h_{i} (X_{i})] = \prod_{i = 1}^{n} \int_{- \infty}^{\infty} h_{i} (x) f_{X_{i}} (x) d x . (9.108)$

The above formulas in the case of independence have analogues for any sub-collection of random variables, i.e., for any random vector $Y = (Y_{1}, ..., Y_{k}) := (X_{i_{1}, ...,} X_{i_{k}}), 1 \leq i_{1} < i_{2} < ... < i_{k} \leq n, 1 \leq k \leq n$ , as discussed above. If all components are independent, then the joint distribution measure of Y is simply the product measure on ℝk:

$μ_{Y_{1}, ..., Y_{k}} (B) = \prod_{i = 1}^{k} μ_{Y_{i}} (B_{i}) := μ_{Y_{1} \times ... \times Y_{k}} (B)$

for all Borel sets B = B1 × ... × Bk in ℝk. The joint CDF of Y is the product of the marginal CDFs

$F_{Y_{1}, ..., Y_{k}} (y_{1}, ..., y_{n}) = \prod_{i = 1}^{k} F_{Y_{i}} (y_{i}),$

with joint PDF (for the case of a continuous random vector) as a product of the marginal densities

$f_{Y_{1}, ..., Y_{k}} (y_{1}, ..., y_{n}) = \prod_{i = 1}^{k} f_{Y_{i}} (y_{i}) .$

Note that in the case that Xi and Xj are independent, ƒXi,Xj(x,y) = ƒXi(x)ƒXj(y), E[XiXj] = E[Xi]E[Xj], so (9.102) gives zero covariance, as required.

9.4 Conditioning

Now that we are equipped with general probability theory, we revisit the subject of conditioning and conditional expectations of random variables. We basically already covered the main topics in Section 6.3.2 of Chapter 6. We recall Definition 6.12 for the expectation of a random variable conditional on a σ-algebra. The definition was stated very generally using expectations. Since any expectation is in fact a Lebesgue integral w.r.t. a given probability measure ℙ, we can also state Definition 6.12 in the equivalent manner using Lebesgue integral notation. In particular, property (ii) in Definition 6.12 reads

$\int_{B} X (ω) d ℙ (ω) = \int_{B} E [X | G] (ω) d ℙ (ω)$

for every B ∊ G.

It is instructive to see how Proposition 6.7 on independence for two random variables follows in the case of continuous random variables. Let X and Y be jointly (absolutely) continuous and independent random variables possessing a joint PDF ƒX,Y(x,y) = ƒX(x)ƒY(y) with marginal PDFs ƒX(x) and ƒY(y). The conditional PDF of Y given X = x is hence the marginal PDF of Y: ƒY|X(y|x) = ƒX,Y(x,y)/ƒX(x) = ƒY(y). Evaluating the conditional expectation in the usual manner gives

$\begin{array}{l} E [h (X, Y) | X = x] & = E [h (x, Y) | X = x] \\ = \int_{ℝ} h (x, y) f_{Y | X} (y | x) d y = \int_{ℝ} h (x, y) f_{Y} (y) d y = E [h (x, Y)] . \end{array}$

Hence, as a random variable we have E[h(X,Y) | X] = ∫ℝh(X, y)ƒY(y)dy ≔ g(X), i.e., E[h(X, Y) | X](ω) = ∫ℝ h(X(ω), y)ƒY(y)dy ≔ g(X(ω)), for each ω ∊ Ω.

Now, to formally show that the above expectation formula is in fact the correct one we need to verify properties (i) and (ii) of Definition 6.12 with G = σ(X). Property (i) holds since g defined by the above integral is a Borel function and hence σ(g(X)) ⊂ σ(X), i.e., g(X) is σ(X)-measurable. For property (ii), we need to show that

$E [I_{X \in B} . E [h (X, Y) | X]] \equiv E [I_{X \in B} . g (X)] = E [I_{X \in B} . h (X, Y)]$

for every Borel set B in σ(X), i.e., B is any Borel set in the range of X and so we take any B ∊ ℬ(ℝ). Expressing these expectations using the joint PDF gives

$E [I_{X \in B} . g (X)] = \int_{B} g (x) (\int_{ℝ} f_{X, Y} (x, y) d y) d x = \int_{B} g (x) f_{X} (x) d x$

and

$E [I_{X \in B} . h (X, Y)] = \int_{B} (\int_{ℝ} h (x, y) f_{X, Y} (x, y) d y) d x .$

For these two quantities to be equal, for every Borel set B, we necessarily must have the equivalence of the x-integrands, i.e.,

$g (x) = \int_{ℝ} h (x, y) \frac{f_{X, Y} (x, y)}{f_{X} (x)} d y = \int_{ℝ} h (x, y) f_{Y | X} (y | x) d y = \int_{ℝ} h (x, y) f_{Y} (y) d y .$

This proves the above expression for g(x) and hence for E[h(X,Y) | X].

If X and Y are jointly continuous and independent random vectors, then their joint PDF is ƒX,Y(x, y) = ƒX(x)ƒY(y) with marginal PDFs ƒX(x) and ƒY(y). The conditional PDF of Y given X = x is the marginal PDF of Y: ƒY|X(Y|X) = ƒY(Y). By basic probability theory, the conditional expectation is given by the n-dimensional integral:

$E [h (X, Y) | X = x] = \int_{ℝ^{n}} h (x, y) f_{Y | X} (y | x) d^{n} y = \int_{ℝ^{n}} h (x, y) f_{Y} (y) d^{n} y = E [h (x, Y)] .$

A similar analysis as given above (for the case of two random variables) formally shows that this is the correct formula satisfying properties (i) and (ii) of Definition 6.12 with G = σ(X). In this case the Borel sets B ⊂ ℝn.

Although we have already covered many of the important properties of conditioning in Section 6.3.3 of Chapter 6, it is still useful to summarize them in the following theorem since they pertain to the more general theory of random variables. Many of the proofs are rather straightforward and are standard in real analysis so we don’t repeat them here. We note that we have also proven some of the properties in the discrete setting in Chapter 6 and that we have already used some of the properties stated below.

Theorem 9.6.

Let X and Y be random variables on (Ω, ℱ, ℙ). Then, the following hold.

1. (Linearity) For any real constants a, b:
$E [a X + b Y | G] = a E [X | G] + b E [Y | G] .$
2. (Nested Expectation)
$E [E [X | G]] = E [X] .$
3. (Tower Property) For any $H \subset G \subset ℱ$ ,
$E [E [X | G] | ℋ] = E [X | ℋ] .$
4. (Independence) If X is independent of $G$ then
$E [X | G] = E [X] .$
5. (Measurability) If X is $G$ -measurable then
$E [X | G] = X .$
6. (Positivity) If X ≥0 (a.s.) then $E [X | G] \geq 0$ (a.s.).
7. (Monotone Convergence) If Xn, n ≥1, isa nonnegative sequence of random variables on (Ω, ℱ, ℙ) and increases (a.s.) to X, then the sequence $E [X_{n} | G], n \geq 1$ , increases (a.s.) to $E [X | G]$ .
8. (Pulling out what is known) If Y is $G$ -measurable and XY is integrable then
$E [X Y | G] = Y E [X | G] .$
9. (Conditional Jensen's Inequality) If φ : ℝ → ℝ is a convex function and X is integrable then
$E [φ (X) | G] \geq φ (E [X | G]) .$

9.5 Changing Probability Measures

Here we shall keep the discussion very succinct. Let us begin by defining a new probability measure $\hat{ℙ} \equiv {\hat{ℙ}}^{(ϱ)}$ by

$\hat{ℙ} (A) \equiv \int_{A} d \hat{ℙ} (ω) := \int_{A} ϱ (ω) d ℙ (ω), (9.109)$

for all $A \in ℱ, i .e ., \hat{ℙ} (A) \equiv \hat{E} [I_{A}] := E [ϱ I_{A}]$ , such that ϱ is chosen to be a nonnegative (a.s.) random variable on (Ω, ℱ) having unit expectation under measure ℙ:

$E [ϱ] \equiv \int_{Ω} ϱ (ω) d ℙ (ω) = 1.$

Note that in order for $\hat{ℙ}$ to be a probability measure we necessarily have $1 = \hat{ℙ} (Ω) = E [ϱ I_{Ω}] = E [ϱ]$ . The measure $\hat{ℙ}$ is also countably additive from the countable additivity property of the Lebesgue integral w.r.t. ℙ, i.e., for any countable collection of pairwise disjoint sets {Ai} ∊ ℱ we have, setting A ≡ ∪iAi in (9.109),

$\hat{ℙ} (\cup_{i} A_{i}) = E [ϱ I_{\cup_{i} A_{i}}] = E [ϱ \sum_{i} I A_{i}] = \sum_{i} E [ϱ I_{A i}] = \sum_{i} \hat{ℙ} (A_{i}) .$

Hence, $\hat{ℙ}$ is a probability measure and (Ω, ℱ, ℙ) is a probability space.

The random variable ϱ corresponds to the so-called Radon—Nikodym derivative of $\hat{ℙ}$ w.r.t. ℙ. Note that $d \hat{ℙ} (ω) = ϱ (ω) d ℙ (ω)$ . It is customary notation to denote ϱ by $\frac{d \hat{ℙ}}{d ℙ}$ , i.e., $d \hat{ℙ} (ω) = \frac{d \hat{ℙ}}{d ℙ} (ω) d ℙ (ω)$ . The notation arises naturally when we compute the expectation of a random variable in the two different measures. The expectation of a random variable X in the original ℙ-measure is denoted by E[X] and we let $\hat{E} [X]$ denote the expectation in the new $\hat{ℙ}$ -measure. Using the definition in (9.109), the expectation of X under measure $\hat{ℙ}$ equals the expectation of $X ϱ \equiv X \frac{d \hat{ℙ}}{d ℙ}$ under measure ℙ:

$\hat{E} [X] = \int_{Ω} X (ω) d \hat{ℙ} (ω) = \int_{Ω} X (ω) ϱ (ω) d ℙ (ω) = E [X ϱ] \equiv E [X \frac{d \hat{ℙ}}{d ℙ}] . (9.110)$

Moreover, if ϱ is strictly positive (a.s.) and Y is integrable under measure ℙ, then its expectation under measure ℙ equals the expectation of $\frac{Y}{ϱ} \equiv Y \frac{d ℙ}{d \hat{ℙ}}$ under measure $\hat{ℙ}$ :

$E [Y] = \int_{Ω} Y (ω) d ℙ (ω) = \int_{Ω} Y (ω) \frac{1}{ϱ (ω)} d \hat{ℙ} (ω) = \hat{E} [\frac{Y}{ϱ}] = \hat{E} [Y \frac{d ℙ}{d \hat{ℙ}}] . (9.111)$

Note that $\frac{d ℙ}{d \hat{ℙ}} = {(\frac{d \hat{ℙ}}{d ℙ})}^{- 1} = \frac{1}{ϱ}$ for any strictly positive ϱ.

Remark: Here we don't state the formal Radon-Nikodym theorem. There are different versions of it in measure theory where measures can be more general than probability measures. One version of the Radon-Nikodym theorem goes as follows. Given two finite measures μ and v on a space (Ω, ℱ) where v is absolutely continuous w.r.t. μ (i.e., all sets of μ-measure zero are v-measure zero), then there is a nonnegative ℱ-measurable function $h \equiv \frac{d v}{d μ}$ such that the v-measure of any set A ∊ ℱ is given by a Lebesgue integral w.r.t. μ: ν(A) = ∫Ah dμ. Moreover, this function is unique w.r.t. measure μ. For our purposes, the Radon—Nikodym theorem guarantees that, given two equivalent probability measures $\hat{ℙ}$ and ℙ, there is nonnegative random variable ϱ satisfying the above relations. Moreover, ϱ is unique (a.s.). The Radon—Nikodym theorem also applies to distribution measures of random variables and joint random variables.

We have already seen how measure changes are applied for discrete random variables. Let us briefly see how measure changes can be applied in the case of absolutely continuous random variables. Assume X ∊ℝ has a PDF ƒ(x) under measure ℙ and a PDF $\hat{f} (x)$ under measure $\hat{ℙ}$ . Although we can further generalize, we shall also assume that these densities are (a.e.) positive on ℝ. Then, for any b ∊ ℝ, we have the CDF of X under measure $\hat{ℙ}, {\hat{F}}_{X} (b) \equiv {\hat{μ}}_{X} ((- \infty, b])$ as

${\hat{F}}_{X} (b) = \hat{E} [I_{{X \leq b}}] \equiv \int_{- \infty}^{b} \hat{f} (x) d x = \int_{- \infty}^{b} \frac{\hat{f} (x)}{f (x)} f (x) d x \equiv E [\frac{\hat{f} (X)}{f (X)} I {X \leq b}] . (9.112)$

By the Radon—Nikodym derivative, we see that the ratio of densities gives the Radon—Nikodym derivative for changing distribution measures of X. In this case, $\frac{d {\hat{μ}}_{X}}{d μ_{X}} (x) = \frac{\hat{f} (x)}{f (x)}$ . This is known as a likelihood ratio. As a random variable we have the Radon—Nikodym derivative $ϱ = ϱ (X) = \frac{\hat{f} (x)}{f (x)}$ . This also generalizes to the multidimensional case in the obvious manner as the ratio of the joint PDFs.

An example of the likelihood ratio in measure changes is to consider a normal random variable. Say that X ~ Norm(a, σ2), i.e., that it has mean a and variance σ2 under measure ℙ. Then, defining the change of measure $ℙ \to \hat{ℙ}$ , i.e., change of distribution measure $μ_{X} \to {\hat{μ}}_{X}$ , by the Radon—Nikodym random variable

$ϱ = \frac{d \hat{ℙ}}{d ℙ} = \frac{σ}{\hat{σ}} \exp [\frac{{(X - a)}^{2}}{2 σ^{2}} - \frac{{(X - \hat{a})}^{2}}{2 {\hat{σ}}^{2}}] \equiv h (X)$

we have that X ~ Norm $(\hat{a}, {\hat{σ}}^{2})$ under measure $\hat{ℙ}$ . This follows from the above likelihood ratio of densities. The PDF of X under measure $\hat{ℙ}$ equals its PDF under measure ℙ times the likelihood ratio $\frac{d {\hat{μ}}_{X}}{d μ_{X}} (x) = h (x) = \frac{\hat{f} (x)}{f (x)}$ . Hence,

$h (X) = \frac{\hat{f} (X)}{f (X)} = \frac{\frac{1}{\hat{σ} \sqrt{2 π}} e^{\frac{- {(X - \hat{a})}^{2}}{2 {\hat{σ}}^{2}}}}{\frac{1}{σ \sqrt{2 π}} e^{\frac{- {(X - a)}^{2}}{2 σ^{2}}}}$

which gives the above result. We note that this change of measure changes both the mean and variance of a normal random variable.

The next (and last) result of this chapter gives us a formula for computing the expectation (under a given measure ℙ) of a random variable conditional on any sub-σ-algebra G ⊂ F via the corresponding conditional expectation of ϱX under another equivalent measure $\hat{ℙ}$ . The new measure $\hat{ℙ}$ is defined in (9.109) with (Radon—Nikodym derivative) random variable ϱ. The theorem is useful when considering measure changes while calculating conditional expectations involving stochastic processes such as those driven by Brownian motions. The sub-σ-algebras are part of a filtration for Brownian motion. The conditioning on the filtration simplifies even further when we are dealing with practical applications involving Markov processes.

Theorem 9.7

(General Bayes Formula). Let (Ω, ℱ, ℙ) and $(Ω, ℱ, \hat{ℙ})$ be two probability spaces with $ℙ ~ \hat{ℙ}$ (i.e., ℙ and $\hat{ℙ}$ are equivalent probability measures). Let the random variable X be integrable w.r.t. $\hat{ℙ}$ and set $ϱ \equiv \frac{d \hat{ℙ}}{d ℙ}$ . Then, the random variable ϱX is integrable w.r.t. ℙ and its expectation under measure $\hat{ℙ}$ conditional on a sub-σ-algebra G ⊂ ℱ is given by (a.s.)

$\hat{E} [X | G] = \frac{E [ϱ X | G]}{E [ϱ | G]} . (9.113)$

Proof. All we need to verify is that the right-hand side of (9.113), $Y := \frac{E [ϱ X | G]}{E [ϱ | G]}$ , is (almost surely) the random variable $\hat{E} [X | G]$ , i.e., the expectation of X under $\hat{ℙ}$ conditional on $G$ . Note that Y must satisfy the two properties in Definition 6.12 with $\hat{ℙ}$ as measure. Hence, we need to show: (i) Y is $G$ -measurable; (ii) $\hat{E} [I_{A} Y] = \hat{E} [I_{A} X]$ , for every event $A \in G$ . Property (i) is obviously satisfied since Y is a ratio of two $G$ -measurable random variables and hence is $G$ -measurable. Property (ii) is shown by first applying the change of measure $\hat{ℙ} \to ℙ$ for an unconditional expectation via (9.110), then making use of the tower property $E [E [\cdot | G]] = E [\cdot]$ in reverse order, pulling out the $G$ -measurable random variable ?A Y in the inner conditional expectation, cancelling out the E[ϱ | $G$ ] term, re-applying the tower property and changing back measures $ℙ \to \hat{ℙ}$ in the final expectation:

$\begin{array}{l} \hat{E} [I_{A} Y] = E [ϱ I_{A} Y] = E [E [ϱ I_{A} Y | G]] = E [I_{A} Y E [ϱ | G]] & = E [I_{A} E [ϱ X | G]] \\ = E [E [ϱ I_{A} X | G]] \\ = E [ϱ I_{A} X] = \hat{E} [I_{A} X] . \end{array}$

Finally, the assumption that X is integrable w.r.t. $\hat{ℙ}$ , i.e., $\hat{E} [| X |] < \infty$ , implies that ϱX is integrable w.r.t. ℙ by changing measures:

$\hat{E} [| X |] = E [ϱ | X |] = E [| ϱ X |] < \infty .$

1The definition also extends to the case X: Ω → [0, ∞] where X can equal ∞, i.e., X = X · ?{0≤X < ∞} + ∞ · ?{X=∞} where the usual convention 0 · ∞ = 0 is adopted. If ℙ(X = ∞) = 0, then we simply take X = X · ?{0≤X < ∞}. If ℙ(X = ∞) > 0, i.e., the set {ω ∊ Ω: X(ω) = ∞} has positive ℙ-measure, then E[X] = ∞.

2The Lebesgue measure m of an interval In = [an, bn], or (an, bn], or [an, bn), or (an, bn), is its length, m(In) = ℓ(In) ≔ bn – an, an ≤ bn. The measure m of a Lebesgue-measurable set (for our purposes a Borel set) B is defined precisely as the smallest total length among all countable unions of intervals in ℝ that (cover) contain B, i.e., for any $B \in ℬ (ℝ), m (B) := \inf {\sum_{n = 1}^{\infty} ℓ (I_{n}) : B \subset \cup_{n = 1}^{\infty} I_{n}}$ .

3There also exist very special types of random variables X where FX is not an absolutely continuous function, yet it has zero derivative $F_{X}^{'} (x) \equiv 0$ on a set of Lebesgue measure zero (a.e.). In this case X is said to be singularly continuous where the CDF is a nondecreasing monotone continuous function with zero derivative (a.e.) and hence there does not exist a PDF fX, i.e., (9.36) (and (9.38)) does not hold. The expectation of g(X) is still defined as a Lebesgue integral in (9.28). The so-called Cantor function on [0, 1] is a well-known textbook example of a CDF of a singularly continuous random variable that is uniformly distributed on the Cantor set C. The Cantor CDF is constant on the complement of the Cantor set, i.e., $F_{X}^{'} (x) \equiv 0 for x \in [0, 1] C .$ There are other known interesting properties of such random variables. However, throughout this text we will have no need for such singular cases so that all continuous random variables are also absolutely continuous.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9 Essentials of General Probability Theory

Create new playlist

Sign In