228 Handb ook of Big Data
First, the algorithm is trained J times, using only data in T
j
each time. Then, for each j,
the performance of
ˆ
η
T
j
is evaluated on all the units i in the left-out portion of the sample,
V
j
. Lastly, the performance measures L(Z
i
,
ˆ
η
T
j
) are averaged across units in V
j
,andacross
sample splits j, giving
ˆ
R(
ˆ
η).
Intuitively,
ˆ
R(
ˆ
η) is a good measure of the risk of
ˆ
η because, unlike the empirical risk,
it incorporates the randomness in
ˆ
η by considering a sample
ˆ
η
T
j
: j =1,...,J,instead
of a fixed
ˆ
η trained using all the sample. In addition, the performance of the estimator is
assessed based on data outside the training set, therefore providing an honest assessment
of predictive power.
13.3.1 Cross-Validation with Correlated Units
In various applications, the data may not be assumed independently distributed. Such is the
case, for example, of longitudinal studies in the medical sciences in which the objective is to
predict the health status of a patient at a given time point, conditional on his health status
and covariates in the previous time point. Specifically, consider an observation unit (e.g.,
a patient) i, with t =1,...,m
i
measurements for each unit (e.g., index t may represent
geographical locations, measurement times, etc.). Denote by X
i
=(X
i1
,X
i2
,...,X
im
i
)the
covariate vector and by Y
i
=(Y
i1
,Y
i2
,...,Y
im
i
) the outcome vector for unit i.EachX
it
may
be a covariate vector itself, containing observations recorded at time t but also at previous
times, for example at the most recent past t − 1. In this case, the correct assumption is
that Z
i
=(X
i
,Y
i
):i =1,...,n are an i.i.d sample of a random variable Z ∼ P
0
.The
predictive function, given by η
0
(x, t)=E(Y
t
|X
t
= x), may be estimated using the same
type of predictive algorithms discussed above, adding to the explanatory variables a time
variable containing the index t.
However, note that these data are independent across the index i, but not the index t.
As a result, the optimality properties of cross-validation presented in Section 13.6 will only
hold if cross-validation is performed on the index set {i : i =1,...,n}.Thisisanimportant
clarification to prevent cross-validation users from na¨ıvely cross-validating on the index set
{(i, t):i =1,...,n; t =1,...,m
i
}, that is, once a unit i is chosen to belong to a validation
dataset V
j
, all its observations (i, t):t =1,...,m
i
must also be in V
j
.
13.4 Discrete Model Selection
An important step of the model selection approach outlined in the introduction of the
chapter involves proposing a collection of estimation algorithms L = {
ˆ
η
k
: j =1,...,K
n
}.
Some of these algorithms may be based on a subject-matter expert’s knowledge (e.g.,
previous studies suggesting particular functional forms, knowledge of the physical nature of
a phenomenon, etc.), some may be flexible data-adaptive methods (of which the statistical
and machine learning literature have developed a large variety involving varying degrees
of flexibility and computational complexity), and some other algorithms may represent
a researcher’s favorite prediction tool, or simply standard practice. For example, in a
regression problem, it may be known to a subject-matter expert that the relation between a
given predictor and the outcome is logarithmic. In that case, a parametric model including
a logarithmic term may be included in the library.
Our aim is to construct a candidate selector based on predictive power. This aim may
be achieved by defining the estimator as the candidate in the library with the smallest the
prediction error estimate
ˆ
R(
ˆ
η
k
). That is, the selector is defined as
ˆ
k =argmin
k
ˆ
R(
ˆ
η
k
) (13.3)