Appendix A
In many situations we want information about variables that are not directly measured, assuming that we have information about some variables which are correlated with the unmeasured variable. If this (cross)correlation is known or estimated, then it can be used for estimating the value of the unmeasured variables.
Consider for instance the interest rates. Short term interest rates are quoted on a daily basis in the money markets for maturities up to, say, one year; but longer term interest rates are traded only indirectly through the bond markets. Theoretically, options dependent on interest rates are priced according to a stochastic process describing the evolution in continuous time of the short term interest rate even though this process is not directly observable. The observed (or measured) variables, when modelling interest rate processes, are the bond prices.
The Wiener filter (see Madsen [2007]) is an example where the known cross-correlation is used together with the projection theorem to estimate an unmeasured time series based on a measured time series. The Kalman filter, which will be introduced in this appendix, is some sort of online version of the Wiener filter.
The main goal of this appendix is to present the projection theorem, and to illustrate the wide range of applications of this theorem. Finally the theorem is used to formulate the (ordinary) Kalman filter. The contents of this appendix is basically based on Madsen [1992] and Brockwell and Davis [1991], and more information about the theory and applications of the projection theorem can be found in those references.
One of the advantages by considering the projection theorem as it is formulated in this appendix is that many of the well-known concepts from two- and three-dimensional Euclidean geometry, such as orthogonality, carry over to the more general Hilbert spaces considered in the following.
By using the projection theorem it can be realized that a unified set of equations can be used in different contexts. Hence it can be shown that many of the methods used in time series analysis, such as prediction, filtering and estimation, are seen in a unified context.
A Hilbert space is simply an inner-product space, i.e. a vector space supplied with an inner product, with an additional property of completeness. The inner product is a natural generalization of the inner (or scalar) product of two vectors in n-dimensional Euclidean space. Since many of the properties of Euclidean space carry over to the inner-product spaces, it will be helpful to keep Euclidean space in mind in all that follows.
Let us first consider a well-known inner-product space, namely the Euclidean space.
Example A.1 (Euclidean space).
The set of all column vectors
is a real inner-product space if we define
It is a simple matter to check that the conditions above are all satisfied.
Definition A.1 (Norm).
Let ||x|| > 0, if x ≠ 0, then the norm of an element x of an inner-product space is defined to be
In the Euclidean space ℝk the norm of the vector is simply its length.
Definition A.2 (The angle between elements).
The angle θ between two nonzero elements x and y belonging to any real inner-product space is defined as
In particular x and y are said to be orthogonal if and only if 〈x, y〉 = 0.
Now let us define the Hilbert space:
Definition A.3 (Hilbert space).
A Hilbert space ℋ is vector space, equipped with an inner product, in which every Cauchy sequence xn converges in norm to some element in x ∊ ℋ. The inner-product space is then said to be complete.
Example A.2 (Euclidean space).
The completeness of the inner-product space ℝk can be verified. Thus ℝk is a Hilbert space.
Example A.3 (The space L2(Ω, ℱ, ℙ)).
Consider a probability space (Ω, ℱ, ℙ) and the collection C of all random variables X defined on Ω and satisfying the condition E[X2] ≤ ∞. It is rather easy to show that C is a vector space.
For any two elements X,Y ∊ C we now define the inner product
Norm convergence of a sequence Xn of elements of L2 to the limit X means
Norm convergence of Xn to X in an L2 space is called mean-square convergence and is written as .
To complete the proof that L2 is a Hilbert space we need to establish completeness, i.e. that if ||Xn − X ||2 → 0 as m, n → ∞, then there exists X ∊ L2 such that Xn → X (see Brockwell and Davis [1991]).
Let us start by considering two simple applications which illustrate the projection theorem in the two types of Hilbert spaces.
Example A.4 (Linear approximation in ℝ3).
Suppose three vectors are given in ℝ3.
Our problem is to find the linear combination which is closest to y in the sense that S = ||y − α1x2 − α2x2||2 is minimized.
One approach to this problem is to write S in the form S = (1/4 − α1)2 + (1/4 − α2)2 + (1 − 1/4α1 − 1/4α2)2 and then to use calculus to minimize with respect to α1 and α2. In the alternative geometric approach we observe that the required vector is the vector in the plane determined by x1 and x2 such that y − α1x1 − α2x2 is orthogonal to the plane of x1 and x2 (see the figure). The orthogonality condition may be stated as
or equivalently
By solving these two equations for the particular values of x1, x2 and y specified, it is seen that α1 = α2 = 4/9, and .
Example A.5 (Linear approximation in L2(Ω, ℱ,P)).
Now suppose that X1, X2 and Y are random variables in L2(Ω, ℱ, P). Only X1 and X2 are observed and we wish to estimate the value of Y by using the linear combination which minimizes the mean square error,
As in the previous example there are at least two possible approaches to the problem. The first is to write
and then to minimize with respect to α1 and α2.
The second method is to use the same geometric approach as in the previous example. Our aim is to find an element in in the set
which implies that the mean square error is as small as possible. By analogy with the previous example we might expect to have the property that is orthogonal to all elements of ℳ. Applying it to our present problem, we can write
or, equivalently, by the linearity of the inner product,
These are the same equations for α1 and α2 as in the previous example, although the inner product is of course defined differently in (A.17). In terms of expectations we can rewrite (A.17) in the form
from which α1 and α2 are easily found.
Before establishing the projection theorem for a general Hilbert space we need to introduce some new terminology.
Definition A.4 (Closed subspace).
A linear subspace ℳ of a Hilbert space ℋ is said to be a closed subspace of ℋ if ℳ contains all of its limit points (i.e. if xn ∊ M and ||xn − x|| → 0 imply that x ∊ ℳ).
Definition A.5 (Orthogonal complement).
The orthogonal complement of a subset ℳ of ℋ is defined to be the set ℳ⊥ of all elements of ℋ which are orthogonal to every element of ℳ. Thus
for all y ∊ M.
Theorem A.1.
If ℳ is any subset of a Hilbert space ℋ then ℳ⊥ is a closed subspace of ℋ.
Proof. Omitted.
Theorem A.2 (The projection theorem).
If ℳ is a closed subspace of the Hilbert space ℋ and x ∊ ℋ, then
1. there is a unique element such that
and
2. and if and only if and . The element is called the (orthogonal) projection of x onto ℳ.
Proof. Omitted – see Brockwell and Davis [1991].
Theorem A.3 (The projection mapping of ℋ onto ℳ).
If ℳ is a closed subspace of the Hilbert space ℋ and I is the identity mapping on ℋ, then there is a unique mapping Pℳ of ℋ onto M such that I − Pℳ maps ℋ onto ℳ⊥. Pℳ is called the projection mapping of ℋ onto ℳ.
Proof. By the projection theorem, for each x ∊ ℋ there is a unique such that . The required mapping is therefore
Theorem A.4 (Properties of projection mappings).
Let ℋ be a Hilbert space and let Pℳ denote the projection mapping onto a closed subspace ℳ. Then
each x ∊ ℋ has a unique representation as a sum of an element of ℳ and an element of ℳ⊥ i.e.
Proof. Omitted – but rather obvious from a geometrical point of view.
In the following a set of equations, the so-called prediction equations will be derived. The equations describe how to find the projection that gives the minimum mean square error (Minimum MSE).
Given a Hilbert space ℋ, a closed subspace ℳ and an element x ∊ ℋ, then the projection theorem shows that the element of ℳ closest to x is the unique element such that
Compare the general equation above with the special cases in the examples prior to projection theorem.
The quantity, , is frequently called the best predictor of x in the subspace ℳ.
Remark A.1.
It is helpful to visualize the projection theorem in terms of Figure A.1, which depicts the special case in which ℋ = ℝ3, and ℳ is the plane containing x1 and x2, and . The prediction equation (A.24) is simply the statement that must be orthogonal to ℳ. The projection theorem tells us that is uniquely determined by this condition for any Hilbert space ℋ and closed subspace ℳ.
The projection theorem and the prediction equations play fundamental roles in time series analysis, especially for estimation, approximation, filtering and prediction. Examples will be given.
Example A.6 (Minimum MSE linear prediction).
Let {Xt, t = 0, ±1,...} be a stationary process on (Ω, ℱ, P) with mean zero and autocovariance function γ(·), and consider the problem of finding the best linear combination
which best approximates Xn+1 in the sense that is minimum. This problem is easily solved with the aid of the projection theorem by taking ℋ = L2(Ω, ℱ, ℙ) and . Since minimization of is identical to minimization of the squared norm , we see at once that . The prediction equations are
which, by the linearity of the inner product, are equal to the n equations
Recalling the definition 〈X, Y〉 = E[XY] of the inner product in L2(Ω, ℱ, ℙ), we see that the prediction equations can be written in the form
where ϕn = (ϕn1,..., ϕnn)′, γn = (γ(1),..., γ(n))′ and . The projection theorem guarantees that there is at least one solution ϕn to the problem. If Γn is singular then there are infinitely many solutions, but the projection theorem guarantees that every solution will give the same (uniquely defined) predictor.
It is well known that the conditional expectation plays a central role in time series analysis, as the optimal prediction (under some mild assumptions) is found using the conditional expectation.
Consider the random variables Y and X from L2.
Definition A.6 (The conditional expectation).
The conditional expectation of X given Y = y is
where fX|Y = y(x) is the conditional density function for X given Y = y.
Remember that E[X|Y = y] is a number, whereas E[X|Y] is a stochastic variable.
It can be shown that the operator E[X|Y] on L2 has all the properties of a projection operator, in particular
Theorem A.5 (Best mean square predictor).
The conditional expectation E[X|Y] is the best mean square predictor of X in ℳY, i.e. the best function of Y for predicting X.
Proof. Follows from the projection theorem.
However, the determination of projections on ℳY is usually very difficult. On the other hand it is relatively easy instead to calculate the projection of X on span{1, Y} ⊆ ℳY, i.e. the linear projection
which gives a subset of the best function of Y (in the mean square sense) for predicting X.
The linear projection (A.30) is a projection of X onto a subspace of ℳY. Therefore it can never have a smaller mean square error than E[X|Y]. However it is of great importance for the following reasons:
If (Y,X)′ has a multivariate normal distribution then the conditional expectation is linear, i.e.
Let us now consider two multivariate stochastic variables X and Y and the corresponding second order representation (first and second order moments for (X,Y)′)
Theorem A.6 (Linear projection in L2).
Given the second order representation for (X,Y)′ the linear projection is given by
and the variance is
Furthermore
i.e. the error X − E[X|Y] is uncorrelated with Y.
Proof. From the prediction equations:
Using the fact that in the multivariate case the inner product in L2 is 〈X,Y〉 = E[XYT] we get
By solving these equations and using the fact that ΣXY = E[XY′] − E[X]E[Y]′ we obtain
Hence the linear projection is
The variance follows immediately
The orthogonality between the error X − E[X|Y] and Y follows directly from the projection theorem.
Theorem A.7.
If (X, Y)T has a normal distribution then X|Y is normal distributed with mean
and variance
The error X − E[X|Y] and Y are stochastic independent.
Proof. Omitted.
Let us illustrate the importance of the equations above by a couple of examples.
Example A.7 (Regression).
Let us consider the regression in L2 of Y on X
and assume that E[X] = E[Y] = 0.
Note that — compared to the discussion above — we have interchanged X and Y. And in order to compare the results directly with the ordinary LS estimator for the general linear model in ℝn we have also interchanged X and θ.
The best estimator is found by the prediction equations
or
Then we get
or
Compare this result with the well-known LS estimator in ℛn.
Next an example where the formulation of the linear projection above is used directly. As this example is very important it is embedded in a section.
As mentioned in the introduction, the Kalman filter can be used for estimating some variables, which are not directly measured, by using some measured variables, which are correlated with the unmeasured variables. In the case of the Kalman filter the correlation between the unmeasured variables X and the measured variables Y is described by a linear state space model.
Consider the linear stochastic state space model
where Xt is a m-dimensional state vector, ut is the input vector and Yt is the measured output vector. The matrices At, Bt and Ct are known and have appropriate dimensions.
The two white noise sequences {e1,t} and {e2,t} are mutually uncorrelated with variance Σ1,t and Σ2,t, respectively.
The matrices At, Bt, Ct, Σ1,t and Σ2,t might be time varying, as indicated by the notation. However, in the rest of this example we skip the index t although all the given results are valid in the time varying case.
Let us consider the problem of estimating Xt+k given the observations {Ys; s = t, t − 1,...} and input {us, s = t − 1,...}. In the case k = 0 the problem is called reconstruction or filtering. The solution to this problem is given by the linear projection theorem.
It is clear that the linear projection theorem also is valid for the conditioned stochastic variable (YX)′|Z. If the stochastic variables have a normal distribution we get
Let us now introduce
which is a vector of all observations until time t. The input is assumed to be known.
Further introduce
and the variances
then we have the Kalman filter
Theorem A.8 (Kalman filter — Optimal reconstruction).
The reconstruction which has the smallest mean square error is given by
and the variance of the reconstruction error is
Further the construction error and the observations are orthogonal, i.e.
Proof. Let X = Xt, Y = Yt and Z = ?t−1 and use the linear projection theorem. See e.g., Madsen [2007] for details.
Together with equations for making one-step predictions in the state space model the above equations give the Kalman filter. It is readily seen that the prediction equations are
with initial values
We now leave the projections in L2 and continue by considering projections in ℝn.
Previously we showed that ℝn is a Hilbert space with the inner product
In many statistical applications it is convenient to consider the weighted inner product
where Σ is a positive definite symmetric matrix.
For both definitions of the inner product we have the norm
Consider a closed subspace ℳ of the Hilbert space ℝn. The following theorem enables us to compute Pℳx directly from any specified set of vectors {x1,...,xm} (m < n) spanning ℳ.
Theorem A.9.
If xi ∊ ℝn, i = 1,..., m, and ℳ = span{x1,..., xm} then
where X is the n × m matrix whose jth column is xj and
Equation (A.68) has at least one solution for β but the prediction X β is the same for all solutions. There is exactly one solution of (A.68) if and only if X′X is non-singular and in this case
Proof. Since Pℳx ∊ ℳ, we can write
The prediction equations (A.24) are equivalent in this case to
or in matrix form
The existence of at least one solution for β is guaranteed by the existence of the projection Pℳx. The fact that Xβ is the same for all solutions is guaranteed by the uniqueness of Pℳx — see the projection theorem.
Remark A.2.
If {x1,..., xm} is a linearly independent set then there must be a unique vector β such that Pℳx = Xβ. This means that (A.68) must have a unique solution, which in turn implies that X′X is non-singular and
The matrix X(XTX)−1XT must be the same for all linearly independent sets {x1,..., xm} spanning ℳ since Pℳ is uniquely defined.
Remark A.3.
Given a real n × n matrix M, how can we tell whether or not there is a subspace ℳ of ℝn such that Mx = Pℳx for all x ∊ ℝn? If there is such a subspace we say that M is a projection matrix. Such matrices are characterized by the next theorem.
Theorem A.10.
The n × n matrix M is a projection matrix if and only if
Proof. Omitted — but it is easily verified that (a) and (b) are satisfied for the matrix X(XTX)−1XT.