Chapter 1

Introduction to Mathematical Statistics

1.1. Generalities

We may define statistics as the set of methods that allow, from the observation of a random phenomenon, the obtainment of information about the probability associated with this phenomenon.

It is important to note that the random character that we attribute to the considered phenomenon is often merely a way to translate our ignorance of the laws that govern it. Also, a preliminary study, taking into account only the observations made, proves interesting; this is the aim of data analysis.

Data analysis is the set of techniques of statistical description whose purpose is the treatment of data without any probabilistic hypotheses. It aims, in particular, to reveal the dominant parameters among those upon which the observation depends.

Descriptive statistics also treats observations without formulating any prior hypotheses, but we may consider these hypotheses to be underlying, since it essentially consists of studying the empirical probabilistic characteristics of observations (histograms, empirical moments, etc.).

However, we may include data analysis in descriptive statistics by defining it as the set of methods that allow the ordering of data and their presentation in a usable form.

Furthermore, simulation is an additional technique often used in statistics: it consists of carrying out fictitious experiments in such a way as to make visible the expressions of chance in the evolution of a phenomenon. A simple, important simulation problem is that of the generation of random numbers, that is, the generation of a sequence of numbers x1, …, xn, which may be considered as the realizations of random variables X1, …, Xn, the distribution of (X1, …, Xn) being given.

The techniques indicated above fall essentially within applied statistics and we will only discuss them sporadically in this book. We will, in general, focus on the framework of mathematical statistics, that is, theoretical statistics based on measure theory and, in part, on decision theory.

1.2. Examples of statistics problems

We will first give some examples of statistics problems, emphasizing the construction of the associated probability model.

1.2.1. Quality control

A manufacturer receives a batch of objects containing an unknown proportion of defective objects. Supposing the number of objects in the batch to be large, an audit may be carried out only by taking a sample of objects from the batch. Given the number of defective objects in the sample, the manufacturer will accept or reject the batch. We may associate several probability models with this problem:

1) Let E be the batch of objects, Ω the set of subsets of E with r elements given with a uniform distribution, and X the random variable “number of defective objects among the r objects drawn”. We know that X follows a hypergeometric distribution with parameters n, n1, and r images where n = cardE and n1 is the number of defective objects:

images

2) If n and n1 are large relative to r, we may use the binomial approximation and suppose that X follows the distribution images. This comes from the fact that when n → ∞, n1/np > 0,

images

3) If r is large compared to p, we may suppose that X follows the Poisson distribution images where λ = rp. This is because

images

when r → ∞, rp → λ > 0.

Since n1 is unknown, so too are the parameters of the above distributions. We must therefore consider the triplet images where images designates the set of hypergeometric distributions with parameters (n, n1, r), where n and r are fixed, and images.

We pose n1/n = p and set a proportion p0 of defective objects beyond which the batch will be refused. It is therefore necessary to determine, in light of the r objects drawn, if p > p0 or if pp0, which will allow us to accept or reject the batch. This is a testing problem (we “test” the quality of the batch).

The choice of a decision criterion is then based on the fact that we may commit two types of error: accept a bad batch or reject a good one. We therefore aim to minimize these measurement errors as much as possible.

1.2.2. Measurement errors

A physicist measures a real value a certain number of times. The values found are not exact due to measurement errors. The problem is then which value to accept for the measured quantity.

To construct the associated probability model, we generally make the following hypothesis: the measurement errors have extremely varied causes (lack of precision or unreliable instruments, misreading by the experimenter, etc.); we may admit as a first approximation that these “causes” are independent of each other: the error is therefore the sum of a large number of small, independent errors.

The central limit theorem allows us to assert that this error follows (approximately) a normal distribution. Moreover, we may often, for reasons of symmetry, suppose that the measurements carried out have the same expectation (mean) value as the quantity considered.

We may therefore associate n independent observations of this quantity with the triplet images where I and J are intervals of images and images, respectively. We must therefore determine m in as precise a way as possible: this is a problem of estimation.

1.2.3. Filtering

An economist observes the evolution of the price of a certain product in the time interval [t1, t2]; he seeks to predict the price of this product at the time t3 (>t2).

This random phenomenon may be modeled in the following way: we consider a family (ξt, tt1) of real random variables where ξt represents the price of the product at time t. It is therefore a question of predicting, in light of the realizations of the random variables ξt, t1tt2, the best possible value of ξt3.

If the distributions of the random variables ξt, or the correlations that exist between them, are not fully known, this problem of prediction falls within the domain of statistics.

The problem of interpolation is of an analogous nature: it concerns the determination of the best possible ξt0 given ξt, where t ∈ [t1, t2] ∪ [t3, t4] with t2 < t0 < t3.

Prediction and interpolation are two particular cases of general filtering problems, or in other words problems of the estimation of an unobserved random variable Y from an observed random variable X 1.

1.2.4. Confidence intervals

We consider a random experiment with two outcomes (written 0 and 1) that we independently repeat n times. We seek to estimate the distribution of this experiment (i.e. the associated probability Pp in {0, 1}); for this, it is sufficient to estimate p = Pp ({1}).

The associated model is written ({0, 1}n, images, where p ∈]0, 1[), and a natural way to estimate p is to use Nn (ω)/n where Nn (ω) designates the number of 1s contained in ω = (i1, …, in), where ij equals 0 or 1.

To determine the precision of this estimate, we may evaluate

images

and we then say that with a confidence of 1 − α, Nn/n is an estimator of p to within ε, or equally that p belongs to the confidence interval images with a confidence level 1 − α.

For the calculation of α, we may, when n is large, use the normal approximation of images for which we write

images

an approximation valid for p ∈]0, 1[.

1.2.5. Homogeneity testing

A doctor wants to test a medication; for this, he chooses a first group of patients to whom the medication is administered, and a second group consists of patients who receive a placebo.

Let Xi be a random variable associated with the ith patient of the first group, which conveys the obtained result: cure, improvement, aggravation, stationary state, etc. We define the variable Yi associated with the jth patient of the second group in a similar way.

The testing problem may then be formulated in the following way: let P1 be the distribution of the Xi and P2 be the distribution of the Yj; do we have P1 = P2? This is a homogeneity test.


1 A general solution to filtering problems in the case where the distribution of the pair (X, Y) is known is given in Chapter 3.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset