A statistical model is said to be non-parametric if Θ is “vast”. When Θ is a vector space or a convex set, “vast” generally means “of infinite dimension”, otherwise the distinction between parametric and non-parametric models is not so clear.
EXAMPLE 8.1.–
1) A Gaussian or exponential model is parametric.
2) Let P0 be the set of probabilities on dominated by the Lebesgue measure λ. The model is non-parametric (we may set θ = dP/dλ, and Θ is a convex set of infinite dimension in L1(λ)).
3) Let P1 be the set of probabilities on which have a unique median. The model is non-parametric.
Non-parametric methods are interesting for three principal reasons:
1) They avoid errors due to the choice of a specific but often erroneous parametric model.
2) They guide the user in the choice of a parametric model.
3) In certain cases, they provide initial estimators for the parameters of a parametric model from which we may construct more precise estimators by successive approximations.
The theory of robustness is the study of decision rules whose efficiency is well resistant to small deformations of a statistical model. There are therefore analogies between non-parametric and robust methods.
The empirical measure , based on the sample (X1,…, Xn), allows us to construct numerous non-parametric estimators which have good asymptotic properties. We refer to Chapter 4 (section 4.1) for details.
When the model is dominated by the Lebesgue measure (see Example 8.1(2)), is not a strict estimator of the distribution. We are therefore led to regularize it by “distributing” the masses 1/n situated at the points Xi; a general method consists of regularizing by convolution: we are given a bounded probability density K such that , and a positive sequence (hn) which tends to 0, and we set:
Then:
therefore, has the density:
where is an estimator of f = dP/dλ whose convergence will be studied.
EXAMPLE 8.2.–
1) If , we obtain the natural estimator:
where Fn denotes the empirical distribution function.
2) If , the obtained estimator is a mix of Gaussian densities:
The following results are due to Parzen [PAR 62].
LEMMA 8.1.– Let H be a real, bounded, λ-integrable function such that:
We set:
where g is λ-integrable and (hn) → 0+. Then, at every point in x where g is continuous:
PROOF.– Since ∫ H(y)dy = ∫ 1/hnH(y/hn)dy, we have:
Then, for all δ > 0
We deduce that, for all ε > 0, |Δn| < ε for well-chosen δ and for large enough n.
THEOREM 8.1.– If in quadratic mean.
PROOF.– K and K2 verify the conditions on the function H from Lemma 8.1. Consequently:
and
From this, we deduce:
therefore, and
REMARK 8.1.–
1) It may be shown that .
2) Under stronger conditions, the chief among which is the existence of f(r), we have:
(Xi, Yi), 1 ≤ i ≤ n, denoting a two-dimensional sample of (X, Y) such that is defined, we seek to estimate a specified version y = r(x) of this regression of Y on X.
Considerations analogous to those in the previous section lead us to construct a non-parametric estimator:
Under regularity conditions, it may be shown that in quadratic mean when nhn → +∞. The use of is of interest each time that r is not an affine function.
Application to prediction: If we observe X1,…, Xn+1 and Y1,…, Yn, then is a predictor (or a prediction) of Yn+1.
EXAMPLE 8.3.–
1) Xj is the mean air pressure on Day j, and Yj is the amount of rainfall on Day is a predictor of the rainfall on Day n + 2.
2) X1,…, Xn+1 are the levels of cholesterol observed in the blood of n + 1 patients, and Y1,…, Yn are the levels of calcium observed in the blood of the first n patients: is a prediction of the calcium level for the (n + 1)th patient.
Let μ be a probability on ; we seek to test H0 = {μ} against an alternative that will be specified later.
For this, we take {A1,…, Ak}, a measurable partition of E such that pj = μ(Aj) >0, j = 1,…, k.
The construction of this test introduces the kernel of the space generated by . The following lemma gives the definition and the properties of this kernel:
LEMMA 8.2.– Let be a measure space (where m is σ-finite), and e(g1,…, gk) be the vector space generated by the functions gj that belong to L2(m). Finally, let h1,…, hk′ (k′ ≤ k) be an orthonormal basis of e(g1,…, gk). The function K, defined by:
is independent of the chosen basis. K is called the kernel of e(g1,…, gk).
PROOF.– Let g ∈ e(g1,…, gk). We have:
Now let K′ be a second kernel associated with any orthonormal basis. As K(x, ·) and K′(z, ·) are in g ∈ e(g1,…, gk), we have:
and since K and K′ are symmetric, we have K′ = K.
Then let h1,…, hk bean orthonormal basis of with h1 ≡ 1, and K be the kernel of this space. We set:
where X1,…, Xn is a sample of the distribution μ; then EμTn = n.
Elsewhere, let us consider the k-dimensional random vector:
From the central limit theorem in , we have:
where Y = (0, ξ2,…, ξk) with ξ2,…, ξk independent and with distribution .
Since convergence in distribution is conserved by continuous transformations, we have:
therefore, converges in distribution to a χ2 with k − 1 degrees of freedom.
However
therefore
Yet, following from the lemma,
hence
and finally
We may construct a test based on , with critical region . This is the χ2 test.
Since , the alternative, H1, will be the set of distributions such that P(X1 ∈ Aj) ≠ pj for at least one value of j.
If c is such that P(χ2(k − 1) > c) = α, we deduce that the obtained test is consistent and of asymptotic size α since when ν ∈ H1.
REMARK 8.2.– The previous test only allows us to verify that P(X1 ∈ Aj) = pj for all j. To have a more precise test, we must vary k as a function of n.
REMARK 8.3.– In practice, the problem is posed in a more complicated manner: the pj are replaced by pj(θ) where θ is a parameter with values in . The test statistic is then of the form:
where k > d − 1, and is the maximum likelihood estimator of θ. Then, under regularity conditions, it may be shown that converges in distribution to a χ2 with k − 1 − d degrees of freedom.
Recall: If F0 is a continuous distribution function and if:
where Fn denotes the empirical distribution function associated with a sample of size n, then:
where the distribution function of K is .
We thus have a test with critical region : the Kolmogorov–Smirnov test.
If wn = w with P(K > w) = α, we have a test of asymptotic size α for testing F = F0 against F ≠ F0. For F ≠ F0, we have and consequently : the test is consistent.
COMMENT 8.1.– This test uses more information than the χ2 test: it is often more precise.
The Cramer–von Mises test uses the statistic:
It may be shown that where the distribution function of C is that of an “infinite χ2” distribution, which is written as , where the are independent χ2 distributions with one degree of freedom.
From this, we have the convergent test of asymptotic size α and critical region Δn > c, with P(C > c) = α.
This test is more robust than the Kolmogorov–Smirnov test: it is more resistant to deformation of a statistical model.
Tests based on the “ranks” of the observations are easy to put into practice and possess good asymptotic properties. Here we give some information about the Wilcoxon test.
Let X1,…, Xn and Y1,…, Ym be two independent samples of real random variables with respective densities f and g. We wish to test against , and for this, we set:
We have E(U) = nmP(X1 ≤ Y1) and, if is true, then:
Furthermore, a simple calculation shows that:
from which we have the Wilcoxon test, with critical region,
It may be established that U is asymptotically Gaussian and that this test is consistent for g such that:
The study of robust methods is quite delicate. We will simply give two examples and indicate the general definition1.
We wish to test θ = 0 against θ > 0 in the model .
If P0 is the set of , Student’s t-test is uniformly most powerful (UMP) without bias, and its critical region is of the form .
Now, if P0 is the set of symmetric distributions with densities, we may use the Wilcoxon test for one sample; this is the test with critical region:
where Ri is the rank of |Xi| among the |Xj| (in other words, Ri = ri if ri is the number of |Xj|’s less than |Xi|).
To determine the asymptotic relative efficiency eV/T of the two tests of size α ∈ ]0, 1[, we denote by βn the power of T at the point θ for a sample of size n, and by νn the size2 of the sample for which V is of power βn at the point θ.
Then:
where σ2 is the variance of P and f is its density.
It may be shown that eV/T varies from 0.864 (for P well chosen and with compact support) to +∞ (for σ2 = +∞ or f2 non-integrable), passing by the value 0.955 for : V is much more resistant than T to deformations of the model, i.e. it is a more robust test than T.
Given the contamination model
we determine the asymptotic efficiency3 of an estimator T = (Tn) of θ by the formula:
where In is the Fisher information on θ and En is the quadratic error of the estimator Tn.
For , the following table is obtained (independent of θ):
We see that decreases rapidly when the contamination of increases.
Now, with [a] denoting the whole part of the number a and X(1),…, X(n) an ordered sample, we set:
where α is given in ]0, 1/2[. This estimator of θ is called the α-truncated mean (it is obtained by eliminating the smallest [nα] and the largest [nα] observations).
For α = 0.03, we obtain the following table:
where is more robust than .
A general definition of robustness was proposed by Hampel [HAM 71].
Let (Tn) be an estimator of θ associated with the asymptotic model . We say that it is robust in P0,θ if:
where ρ is the Prokhorov distance, defined by:
with .
Note that the Prokhorov distance shows the deformation of the statistical model due to rounding and gross errors.
EXERCISE 8.1.– For every distribution function F, we define the generalized inverse:
with the convention inf .
Let (Xi)i≥1 be an independent and identically distributed sequence. We write Fn for the empirical distribution function. With regard to the consistency and the normality of the empirical quantiles , we have:
1) If #{x: F(x) = u} ≤ 1, then converges almost surely to F−1(u).
2) Let 0 < u1 < … < uk < 1. We suppose that the function F is differentiable at the points F−1(u1),…, F−1(uk), with a strictly positive derivative at these points. Then:
where the matrix C is defined by Ci,j = min(ui, uj)-uiuj/F′(F−1(ui))F′(F−1(uj)).
Let (Zi)i≥1 be an independent and identically distributed sequence. We suppose that Z1 has a known, symmetric, and strictly positive density f. We observe Xi = λZi + θ for i = 1,…, n, λ being strictly positive. We write FZ and FX for the respective distribution functions of Z1 and X1.
1) Show that .
2) Let u ∈]0, 1/2[. Express θ and λ as functions of , and .
3) Give two strongly consistent estimators, and , of θ and λ, based on the empirical quantiles and of the observations (Xi)1≤i≤n.
4) We further suppose that the density f is continuous. Determine the asymptotic behavior of:
EXERCISE 8.2.– Let X be a random variable with distribution function F. Let (X1,…, Xn) be an i.i.d. sample of Fθ(x) = F(x − θ). We are interested in the estimation of θ when F is the distribution function of a symmetric random variable of variance σ2.
1) i) Show that EθX1 = θ and that θ = argmintEθ(X1 − t)2.
ii) Show that satisfies:
2) i) Show that if F is continuous and strictly increasing, then and θ = argmintEθ|X1 − t|.
Hint: Use the equation:
and Fubini’s theorem to conveniently rewrite Eθ|X − t|.
3) Supposing F is the distribution function of the normal distribution , compare the variances of the limits in distribution of the estimators of θ constructed from the mean and the empirical median.
4) As above when F has the density 1/2e−|x|.
EXERCISE 8.3.– A statistician observes n independent and identically distributed random variables with distribution , and wishes to estimate θ > 0. He proposes the following three estimators:
where is the (generalized) inverse of the empirical distribution function.
1) Explain the ideas leading to the proposition of each estimator.
2) Give the limit in distribution of , where the sequences ai,n are chosen such that we obtain non-degenerate limits in distribution.
3) Which estimator do you prefer?
EXERCISE 8.4.– Let be a sequence of independent and identically distributed random variables.
1) What is the distribution of From this, deduce that of , the empirical distribution function for fixed x. Show that, for all x, limn→∞ Fn(x) = F(x) a.s. where F is the distribution function of X1.
2) Let us suppose F to be continuous. Let ε > 0 and such that N = 1/ε is an integer.
i) Show that there exists a sequence z0 = −∞ < z1 < … < zN−1 < zN = +∞) (depending on ε) such that F(zk) = k/N, k = 0,…, N.
ii) Show that, for every element of [zk, zk+1], Fn(x) − F(x) ≤ Fn(zk+1) − F(zk+1) + ε and Fn(x) − F(x) ≥ Fn(zk) − F(zk) − ε.
iii) Deduce that .
EXERCISE 8.5.– Consider n real i.i.d. random variables X1,…, Xn, following the Cauchy distribution with density:
1) Take the empirical mean as an estimator of m.
i) What is the distribution of ?
ii) Study the convergence of in quadratic mean, probability, and distribution.
iii) What do you think of this estimator?
2) We now arrange the data in increasing value and we write X(1) < X(2) < … < X(n) for the obtained values. To estimate m, we set:
i) Show that . where Fn is the empirical distribution function.
ii) Show that, for the considered distribution, P(X1 < m) = 1/2.
iii) Using the previous exercise, show that . Comment on this result.
Hint: You may show that . where F is the distribution function of X1.
EXERCISE 8.6. (Non-parametric regression estimation).–Let (Xn, Yn), n ≥ 1, be a sequence of independent random variables, with values in , of the same distribution, with the continuous, strictly positive density f(x, y). We suppose that Yn is integrable and we wish to estimate the regression , that is the function:
where .
r may then be written in the form:
To estimate it from a sample of size n, we set:
where K is a continuous, symmetric, bounded, and strictly positive density and where hn → 0 and nhn → ∞ when n → ∞.
1) We suppose that . Show that:
2) Use the results obtained in the estimation of the density to deduce that:
3) Establish the decomposition:
where x is omitted.
4) We suppose that φn is bounded. Show that:
EXERCISE 8.7.– In this exercise, a lower bound for the Fisher information associated with the translation model X = μ + ε is sought, where the unknown parameter is . We suppose that E(ε) = 0, E(ε2) = σ2 (known), and that ε has a density f, which is assumed to be strictly positive and continuously differentiable on .
1) Recall the expression of the Fisher information I associated with μ, which we will assume to be finite in the following.
2) Using a very simple unbiased estimator while we make use of only one observation, show that I ≥ 1/σ2. Deduce that if we use n independent observations with the same distribution as X, and if there exists an unbiased estimator that attains the Cramer–Rao bound, , then .
3) In this question, we wish to determine the densities f which reach the lower bound for I.
i) Show that ∫ xf′(x − μ)dx = −1 and that:
Deduce that E(εf′(ε)/f(ε)) = −1.
ii) Using the conditions for equality in the Cauchy–Schwarz inequality, show that I = 1/σ2 if and only if ε is of distribution .
iii) Supposing that we make use of n i.i.d. observations with the same distribution as X, and that there exists an unbiased estimator that attains the Fisher limit, , show that if ε is not of distribution , then this estimator has a quadratic loss which is strictly less than that of the empirical mean .
EXERCISE 8.8.–Let Xi, i = 1,…, n (n ≥ 2), be i.i.d. with a distribution of density:
1) Give the joint density fn(x1,…, xn; θ) of the observations. From this, deduce the maximum likelihood estimator of θ, giving its distribution. Construct, using this estimator, an unbiased estimator, , and calculate its variance.
2) Compare this with the results obtained in the previous exercise.
EXERCISE 8.9.– A type of mouse is afflicted by an illness M with a rate of 20%. We wish to know if the absorption of a certain product increases this rate. Of 100 mice having absorbed the product, 27 are afflicted by M.
1) Carry out a classical test of size α = 0.05.
2) Carry out a χ2 goodness-of-fit test with size α = 0.05.
3) Compare the obtained results.
EXERCISE 8.10. (Estimation by explosion).– Let (Xn, n ≥ 1) and (Yn, n ≥ 1) be two sequences of real random variables defined on the probability space such that:
where g is an unknown, defined function, which is continuous on .
Furthermore, let K be a real, continuous, strictly positive density, which verifies lim|x|→∞ x2K(x) = 0. We set:
where hn > 0 verifies limn→∞ hn = 0.
We wish to estimate g from the observations (Xi, Yi), 1 ≤ i ≤ n, using “the explosion” of fn.
1) Establish the following preliminary results:
i) K is bounded.
ii) There exist some α and β > 0 such that K(u) ≥ β for |u| ≤ α.
iii) If y ≠ g(x) and if ε ∈ ]0, 1/2|y − g(x)|[, there exists h > 0 such that:
2) Show that, for fixed x,
when n → ∞. You may use (i) and (iii).
3) Supposing that g is Lipschitzian, of order k at the point x (i.e. that |g(x′) − g(x)| ≤ k|x′ − x|, ), establish the lower bound:
4) We further suppose that the real random variables Xn are i.i.d. and of continuous and strictly positive density f. Show that:
and that, if ,
5) An estimator gn of g is defined by setting:
Show that gn (x) is -measurable, and that
on the condition that .
1 For a complete exposition of robustness, the reader may consult [HUB 09].
2 ±1.
3 If it exists.