228 Handb ook of Big Data
First, the algorithm is trained J times, using only data in T
j
each time. Then, for each j,
the performance of
ˆ
η
T
j
is evaluated on all the units i in the left-out portion of the sample,
V
j
. Lastly, the performance measures L(Z
i
,
ˆ
η
T
j
) are averaged across units in V
j
,andacross
sample splits j, giving
ˆ
R(
ˆ
η).
Intuitively,
ˆ
R(
ˆ
η) is a good measure of the risk of
ˆ
η because, unlike the empirical risk,
it incorporates the randomness in
ˆ
η by considering a sample
ˆ
η
T
j
: j =1,...,J,instead
of a fixed
ˆ
η trained using all the sample. In addition, the performance of the estimator is
assessed based on data outside the training set, therefore providing an honest assessment
of predictive power.
13.3.1 Cross-Validation with Correlated Units
In various applications, the data may not be assumed independently distributed. Such is the
case, for example, of longitudinal studies in the medical sciences in which the objective is to
predict the health status of a patient at a given time point, conditional on his health status
and covariates in the previous time point. Specifically, consider an observation unit (e.g.,
a patient) i, with t =1,...,m
i
measurements for each unit (e.g., index t may represent
geographical locations, measurement times, etc.). Denote by X
i
=(X
i1
,X
i2
,...,X
im
i
)the
covariate vector and by Y
i
=(Y
i1
,Y
i2
,...,Y
im
i
) the outcome vector for unit i.EachX
it
may
be a covariate vector itself, containing observations recorded at time t but also at previous
times, for example at the most recent past t 1. In this case, the correct assumption is
that Z
i
=(X
i
,Y
i
):i =1,...,n are an i.i.d sample of a random variable Z P
0
.The
predictive function, given by η
0
(x, t)=E(Y
t
|X
t
= x), may be estimated using the same
type of predictive algorithms discussed above, adding to the explanatory variables a time
variable containing the index t.
However, note that these data are independent across the index i, but not the index t.
As a result, the optimality properties of cross-validation presented in Section 13.6 will only
hold if cross-validation is performed on the index set {i : i =1,...,n}.Thisisanimportant
clarification to prevent cross-validation users from na¨ıvely cross-validating on the index set
{(i, t):i =1,...,n; t =1,...,m
i
}, that is, once a unit i is chosen to belong to a validation
dataset V
j
, all its observations (i, t):t =1,...,m
i
must also be in V
j
.
13.4 Discrete Model Selection
An important step of the model selection approach outlined in the introduction of the
chapter involves proposing a collection of estimation algorithms L = {
ˆ
η
k
: j =1,...,K
n
}.
Some of these algorithms may be based on a subject-matter expert’s knowledge (e.g.,
previous studies suggesting particular functional forms, knowledge of the physical nature of
a phenomenon, etc.), some may be flexible data-adaptive methods (of which the statistical
and machine learning literature have developed a large variety involving varying degrees
of flexibility and computational complexity), and some other algorithms may represent
a researcher’s favorite prediction tool, or simply standard practice. For example, in a
regression problem, it may be known to a subject-matter expert that the relation between a
given predictor and the outcome is logarithmic. In that case, a parametric model including
a logarithmic term may be included in the library.
Our aim is to construct a candidate selector based on predictive power. This aim may
be achieved by defining the estimator as the candidate in the library with the smallest the
prediction error estimate
ˆ
R(
ˆ
η
k
). That is, the selector is defined as
ˆ
k =argmin
k
ˆ
R(
ˆ
η
k
) (13.3)
Estimator and Model Selection Using Cross-Validation 229
which is referred to as the discrete super learner in [5]. Its corresponding predictive function
is given by
ˆ
η
DSL
=
ˆ
η
ˆ
k
, that is, the predictive function of the algorithm with the smallest
prediction risk.
For example, consider the problem described in Section 13.2, where we consider the
following library of candidate estimators (we give only brief and intuitive descriptions of
each algorithm; complete discussions may be found in most modern statistical learning
textbooks).
Logistic regression with main terms (generalized linear model [GLM]). This prediction
method represents standard practice in fields such as epidemiology. The main idea is to
estimate the coefficients β
j
in the parametrization
logit η
0
(x)=β
0
+
41
j=1
β
j
x
j
where logit(p)=log(p/(1 p)).
Generalized additive model (GAM ). The main idea of a GAM is to extend the GLM
by considering a model where the predictors enter nonlinearly in the equation. That is,
consider the following model:
logit η
0
(x)=β
0
+
41
j=1
β
j
f
j
(x
j
)
where the functions f
j
are assumed smooth but otherwise unspecified and are typically
estimated using smoothing techniques such as cubic splines or local polynomial
regression.
SVM. In its simplest form (linear), an SVM is a classifier that separates the X space
using a hyperplane, in a way such that the points Y in each of two categories {0, 1} are
maximally separated. The intuition behind the prediction method is best explained with
a figure; a toy example for two-dimensional X and binary Y with perfect separation is
presented in Figure 13.2.
RF. This is a prediction algorithm that runs many classification trees. To obtain the
probability
ˆ
η(x), the vector x is run down each of the trees in the forests, and the
01234
0
2
4
6
8
10
x
1
x
2
FIGURE 13.2
SVM illustration. The shape of the points defines the categories, the solid and dotted
lines represent the SVM maximum separation margin. The dashed line provides suboptimal
separation.
230 Handb ook of Big Data
classification results are averaged. Each tree is grown by running a classification tree on
q p randomly selected variables on a bootstrap sample from the original data. Each
tree is grown to the largest extent possible.
Multivariate adaptive regression splines (MARS ). The predictor is built as
η
0
(x)=β
0
+
S
s=1
β
j
B
s
(x)
where each B
s
is a basis function and S is a tuning parameter. The basis functions take
the form of hinge functions max(0,x
j
c
j
)ormax(0,c
j
x
j
), where c
j
is a tuning
parameter called knot. Basis functions may also take the form of the product of two or
more hinge functions. The MARS predictor is built in a forward–backward fashion, by
adding and deleting terms to minimize the sum of squared errors.
Mean. For comparison, we add the most na¨ıve prediction algorithm:
ˆ
η(x)=
¯
Y .
The cross-validated risk
ˆ
R(
ˆ
η
k
) of each of these algorithms applied to our illustrating example
is presented in Table 13.4, and may be computed using the SuperLearner package in R (code
available in Appendix A), which uses the mean squared-error function. These estimated risks
indicate that the best predictor for this dataset, among the candidates studied, is the RF.
Figure 13.3 shows the receiver operating characteristic (ROC) curves computed with
cross-validated and non-cross-validated predictions. The area under the curve (AUC) is also
TABLE 13.1
Cross-validated risk for each algorithm applied to
classification of chemicals.
GLM GAM SVM RF MARS Mean
0.1003 0.0985 0.1021 0.0933 0.1154 0.224
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
AUC
GLM 0.945
GAM 0.955
SVM 0.932
RF 1
MARS 0.97
Mean 0.5
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
AUC
GLM 0.924
GAM 0.924
SVM 0.919
RF 0.934
MARS 0.894
Mean 0.409
(a) (b)
FIGURE 13.3
ROC curve and AUC for each prediction method. (a) Complete sample prediction. (b)
Cross-validated prediction.
Estimator and Model Selection Using Cross-Validation 231
displayed as a performance measure: an area of 1 indicates perfect prediction, whereas an
area of 0.5 indicates that the predictor is not better than a coin toss. This graph illustrates
the importance of using cross-validation. Consider first the ROC curves in Figure 13.3a.
In this graph, the RF seems to perfectly fit the data, arising suspicion for overfitting
and poor generalization power. Meanwhile, MARS seems to be the best performing
algorithm. Figure 13.3b shows the ROC computed with cross-validated predictions. Note
that, according to Figure 13.3b, the conclusions obtained from Figure 13.3a are wrong. The
RF has the best generalization power, whereas the generalization power of the MARS is the
poorest among all the algorithms considered.
13.5 Model Stacking and Super Learning
In the previous section, we described an algorithm that may be used to select a unique
estimator from a discrete library of candidate algorithms. In this section, we improve discrete
model selection by using an ensemble model that combines the predictions of all the models
in the discrete candidate library. The main idea behind the super learner (or more generally
a stacked predictor) is to use a parametrization
ˆ
η
α
to combine the predictors
ˆ
η
k
. Examples
of this parametrization include
ˆ
η
α
(z)=
K
k=1
α
k
ˆ
η
k
(z)
expit
K
k=1
α
k
logit(
ˆ
η
k
(z))
where expit is its inverse of the logit function. The first parametrization is often referred
to as a linear model, whereas the second one is referred to as a logistic model, and is often
used for binary regression. In the linear model, one may decide to restrict the values α
k
to
the set of positive weights summing up to one. This constraint on the α parameters would
be particularly relevant for estimation of a conditional density function, because it would
guarantee that the final estimate is a proper density function.
For simplicity, we focus on parametric models for the ensemble model. In principle,
however, any prediction method could be used to combine the predictors in the discrete
library.
In this formulation, the optimal selection problem may be equivalently framed as
selection of the optimal weights α in a library of candidate estimators given by L =
{
ˆ
η
α
(z):α}. It can be seen that the library used for the discrete super learner of the previous
section is a subset of this library where α
k
∈{0, 1}. Selection of a candidate estimator in this
library is carried out in a similar fashion to discrete model selection, but the optimization
is performed in an infinite set as opposed to a finite one. The super learning predictive
function is thus defined as
ˆ
η
SL
=
ˆ
η
ˆ
α
, where the selector equals
ˆ
α =argmin
α
ˆ
R(
ˆ
η
α
) (13.4)
The optimization in Equation 13.4 is often a convex optimization problem that can be
readily solved using standard software. In particular, computing
ˆ
α involves only one
232 Handb ook of Big Data
additional step compared to the discrete selection algorithm of the previous section. As
an example, consider the regression problem with a continuous outcome and note that
L(Z
i
,
ˆ
η
α,T
j
)=
Y
i
α
k
ˆ
η
k,T
j
(X
i
)
2
=
Y
i
K
n
k=1
α
k
H
k,i
2
where we have introduced a set of variables H
k
: k =1,...,K
n
, computed, for unit i,as
H
k,i
=
ˆ
η
α,T
j
(X
k,i
). The cross-validated risk is equal to
ˆ
R(
ˆ
η
α
)=
1
J
J
j=1
1
|V
j
|
i∈V
j
L(Z
i
,
ˆ
η
α,T
j
)
=
1
n
n
i=1
Y
i
K
n
k=1
α
k
H
k,i
2
Thus, computation of
ˆ
α in Equation 13.4 is reduced to computation of the ordinary least-
squares estimator of a regression of Y with predictors H
k
: k =1,...,K
n
.Notethat
the variables H
k
must also be computed to evaluate the discrete super learner of the
previous section. As a result, computation of the super learner involves running only one
additional regression, compared to the discrete super learner. This optimization problem
under a constraint on the parameter space such as restricting the values α
k
to the set of
positive weights summing up to one is implemented in the SuperLearner package in R. In
our chemical classification example, the weights of each algorithm in the super learner are
given in Table 13.2.
Note that the RF gets more than half the weight in the predictor, in agreement with
the results from the discrete super learner of the previous section.
13.5.1 Assessing the Prediction Error of a Stacked Predictor
As with any prediction method, the considerations presented in Section 13.3 about over
fitting apply to a stacked predictor such as the super learner. Consequently, cross-validation
provides the correct tool to assess the performance of the super learner. The cross-validated
prediction error is thus given by
ˆ
R(
ˆ
η
SL
)=
1
J
K
j=1
1
|V
j
|
i∈V
j
L(Z
i
,
ˆ
η
SL,T
j
)
where η
SL,T
j
is the super learner trained using only data in T
j
. Note that training
ˆ
η
SL,T
j
involves further splitting of T
j
in validation and training subsamples, which results in larger
computation times for the risk of the super learner. In our chemical classification data, we
obtain the results presented in Table 13.3, using the function CV.SuperLearner from the
package SuperLearner (code in Appendix A).
TABLE 13.2
Weights of each prediction algorithm in the chemical
classification data.
GLM GAM SVM RF MARS Mean
0.0000 0.2641 0.0109 0.5514 0.1736 0.0000
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset