Let be a real, square-integrableprocess with basis space , be the sub-σ-algebra generated by Xn−j, j ≥ 0, and be the closed linear subspace of generated by the same variables and the constant 1.
We wish to predict Xn+h from the observed variables X1,…,Xn. The strictly positive integer h is called the horizon of the prediction.
With respect to the quadratic error, the best predictor given Xn−j, j ≥ 0, is the conditional expectation
This is the orthogonal projection in of Xn+h onto .
The best linear predictor of Xn+h is its orthogonal projection onto . If (Xt) is Gaussian, it coincides with .
A statistical predictor is a known function of the data:
The prediction error is written as:
as since
The error being structural, the statistician must seek to minimize the “statistical error”
The linear prediction error is similarly written as the sum of a statistical error and a structural error.
One may, generally speaking, distinguish between two classes of prediction methods: empirical method and those based on the introduction of a model. This distinction is, in fact, imprecise, as empirical methods often contain an underlying model for which they are optimal.
This is the predictor:
It has good properties for a model of the form:
where m and (ετ) is a (strong) white noise.
Then , and the prediction error is written as:
Note that this predictor is unbiased, i.e. . When εt is Gaussian, we know that is the best unbiased statistical predictor for Xn+h. may be calculated recursively using the formula:
This method, which is widely used in practice, consists of assigning weights to the observations that tend to 0 at an exponential rate:
where 0 < q < 1 and c is a normalization constant.
Usually, we choose c = 1 − q and 0.7 ≤ q ≤ 0.95.
One may empirically determine q by comparing predictions with observations. Set
where denotes the predictor associated with q and with the data X1,…,Xt−1, and n0 is chosen to be large enough (i.e. n0 = [n/2]), and choose the q that minimizes Δ(q).
We will see in section 15.3, that exponential smoothing is optimal for a very particular underlying model.
The predictor is calculated recursively using the relation:
These are defined by:
These are good predictors when the observed phenomenon varies a little, or rarely. Thus, in meteorology, the prediction “the weather tomorrow will be the same as today” is correct 75% of the time.
In fact, Xn is the best predictor if and only if:
that is, if and only if (Xt, t ≥ 1) is a martingale.
Notably, this is the case for the random walk model Xn = ε1 + … + εn, n ≥ 1.
This consists of adjusting the trend of the observed process to be a function of the form , where the fj are given and the αj are to be estimated.
The chosen, linearly independent fj may be power functions, exponentials or logarithms, periodic functions, etc.
One underlying model could be of the form:
[15.1]
The αj can be estimated using the least-squares method, by minimizing
[15.2]
Equalities [15.1] for t = 1,…, n are written in matrix form:
where X is the observation vector, ε is the noise vector with n components, α is the vector of parameters to be estimated, and Z is an n × k matrix where the element in the tth line and the jth column is fj(t). Z is assumed tobe of rank k.
With this notation, the minimization of [15.2] gives the unique solution:
where Z′ is the transpose of Z.
It may be shown that is the unbiased linear estimator1 of minimal variance (see the Gauss–Markov theorem, see Theorem 5.9).
Thus, the predictor obtained at the horizon h is:
This is the unbiased linear predictor with minimal variance.
We first suppose that the observed process is an ARMA(p,q), and we seek a linear prediction of Xn+h.
Recall that an ARMA process has the two following representations (see Chapter 14, equations [14.6] and [14.8]):
and
Since (εt) is the innovation of the process, we deduce the best linear predictor with horizon 1:
[15.3]
and
[15.4]
Relation [15.3] is not exploitable in practice, as the εn+1−j are not observed. If (Xt) is an AR(p) process, we may use [15.4], replacing the non-zero πj with the conditional maximum likelihood estimator (section 14.5.2), whence the predictor
Note that the predictor is no longer linear, since the are functions of X1,…,Xn.
In other cases, the situation is more complicated since a direct estimation of the πjis not used. We may, however, obtain such an estimator by considering relation [14.8] from Chapter 6:
where P and are replaced by their estimators and .
For prediction in the ARIMA model, we refer to [BOS 87a], [BRO 91] and [GOU83].
We only examine one particular case: consider an IMA(1,1) process defined by:
where 0 < θ < 1.
A recursive calculation then shows that
and the best predictor is obtained by exponential smoothing with q = θ.
As an example, let us examine the case of an Ornstein–Uhlenbeck process (see Example 13.1, part 3):
with θ > 0.
Then
Indeed, e−θhXt is a square-integrable function of (Xs, s ≤ t), and for s ≤ t:
as Wiener processes have independent increments, is σ(Wt+k − Wt,0 ≤ k ≤ h)-measurable and Xs is σ(Wυ − Wu, −∞ < u ≤ υ)-measurable.
Therefore, e−θh Xt is the best linear predictor of XT+h given (Xs, s ≤ t), but since Ornstein–Uhlenbeck processes are Gaussian, it is also the best nonlinear predictor.
A straightforward calculation shows that the prediction error is written as:
For small h, it is therefore of order h, and as h becomes infinite, it tends to
Estimating θ, the statistical predictor is obtained.
EXERCISE 15.1.– Let be a strong white noise and Y be a real random variable such that υ2 = EY2 ∈]0, +∞[ and EY = 0. We assume that Y and the processes (εt) are independent, and we set
[15.5]
1) Show that (Xt) is a strictly stationary process. Calculate its autocovariance. Does it have a spectral density?
2) Show that converges in mean square when n tends to infinity and determine its limit. Do the same for the sequences
3) Show that , where is the closed vector space generated by (Xu, u ≤ t − s).
4) Show that [15.5] is the Wold decomposition of (Xt).
5) We now seek to predict Xn+1, the measure of error being the quadratic error.
i) What is the best linear prediction of Xn+1 based on (Xt, t ≤ n)?
ii) What is the best linear prediction of Xn+1 based on (Xt, 1 ≤ t ≤ n)?
iii) Calculate the prediction errors associated with , and . Study their asymptotic behavior.
iv) A statistician observes X1,…, Xn. Which predictor of Xn+1 might he/she choose?
6) Setting , study the asymptotic behavior of . Construct an estimator and a confidence interval for .
EXERCISE 15.2.– Let be a measurable process with values in . We will suppose that the density fs, t of (Xs, Xt) exists for every pair (s, t) such that s ≠ t and that the density f of Xt does not depend on t. We set:
We denote by the usual norm on Lp(λ), where λ is the Lebesgue measure on , and we make the hypothesis:
where p ∈ [1,+∞[.
Now, to estimate f from (Xt, 0 ≤ t ≤ T), we construct the estimator with the kernel:
where K and hT > 0 are chosen by the statistician. In particular, we will choose K such that with (1/p) + (1/q) = 1.
1) Show that the variance of fT(x) satisfies:
specifying the constant Cp.
2) We now suppose that f is of the class C2, and that it and its (partial) derivatives are bounded. Evaluate the asymptotic bias of fT(x) when T → +∞.
3) How is hT chosen to optimize the asymptotic quadratic error of fT(x)?
4) Comment on the results obtained when p = 2 or p = ∞.
5) In the following, X is a one-dimensional, stationary, Gaussian process. Study the condition relative to X using the autocorrelation ρ(u) = Corr(X0, Xu), .
6) Use fT to construct estimators of E(X0) and V(X0). Show that these estimators are convergent. Study their asymptotic quadratic errors. Comment on the results.
7) We wish to predict XT+h (H > 0) from (Xt,0 ≤ t ≤ T). For this, we construct a kernel regression estimator. Study its asymptotic quadratic error, making some convenient hypotheses similar to .
Deduce a predictor studies the behavior of the statistical error of the prediction when T → +∞ (H fixed).
8) We now suppose that X is an Ornstein–Uhlenbeck process with parameter θ > 0. Determine E(XT+h|XT). Compare with the parametric predictor obtained by replacing θ by the maximum likelihood estimator θ*in E(XT+h|XT).
EXERCISE 15.3. Information inequality for prediction.– Reconsider the information inequality (Chapter 5), where g(θ) is replaced by g(X, θ) and the estimator is replaced by an unbiased predictor p(X), i.e. a predictor such that
Show, giving the necessary regularity conditions, that we obtain:
EXERCISE 15.4.– Given a Poisson process (Nt, t ≥ 0) with intensity θ, we observe X = NT and we wish to predict NT+h (h > 0), which is equivalent to the prediction of Eθ(NT+h|NT).
1) Determine Eθ(NT+h|NT).
2) Show that p(NT) = ((T+h)/T)NT is an unbiased predictor (see Exercise 15.3).
3) Show that p(NT) is efficient (i.e. it reaches the bound obtained in Exercise 15.3).
1 With respect to the data.