13. Estimator and Model Selection Using Cross-Validation (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

228 Handb ook of Big Data

First, the algorithm is trained J times, using only data in T

each time. Then, for each j,

the performance of

is evaluated on all the units i in the left-out portion of the sample,

. Lastly, the performance measures L(Z

) are averaged across units in V

,andacross

sample splits j, giving

η).

Intuitively,

η) is a good measure of the risk of

η because, unlike the empirical risk,

it incorporates the randomness in

η by considering a sample

: j =1,...,J,instead

of a ﬁxed

η trained using all the sample. In addition, the performance of the estimator is

assessed based on data outside the training set, therefore providing an honest assessment

of predictive power.

13.3.1 Cross-Validation with Correlated Units

In various applications, the data may not be assumed independently distributed. Such is the

case, for example, of longitudinal studies in the medical sciences in which the objective is to

predict the health status of a patient at a given time point, conditional on his health status

and covariates in the previous time point. Speciﬁcally, consider an observation unit (e.g.,

a patient) i, with t =1,...,m

measurements for each unit (e.g., index t may represent

geographical locations, measurement times, etc.). Denote by X

=(X

,...,X

)the

covariate vector and by Y

=(Y

,...,Y

) the outcome vector for unit i.EachX

may

be a covariate vector itself, containing observations recorded at time t but also at previous

times, for example at the most recent past t − 1. In this case, the correct assumption is

that Z

=(X

):i =1,...,n are an i.i.d sample of a random variable Z ∼ P

.The

predictive function, given by η

(x, t)=E(Y

= x), may be estimated using the same

type of predictive algorithms discussed above, adding to the explanatory variables a time

variable containing the index t.

However, note that these data are independent across the index i, but not the index t.

As a result, the optimality properties of cross-validation presented in Section 13.6 will only

hold if cross-validation is performed on the index set {i : i =1,...,n}.Thisisanimportant

clariﬁcation to prevent cross-validation users from na¨ıvely cross-validating on the index set

{(i, t):i =1,...,n; t =1,...,m

}, that is, once a unit i is chosen to belong to a validation

dataset V

, all its observations (i, t):t =1,...,m

must also be in V

13.4 Discrete Model Selection

An important step of the model selection approach outlined in the introduction of the

chapter involves proposing a collection of estimation algorithms L = {

: j =1,...,K

Some of these algorithms may be based on a subject-matter expert’s knowledge (e.g.,

previous studies suggesting particular functional forms, knowledge of the physical nature of

a phenomenon, etc.), some may be ﬂexible data-adaptive methods (of which the statistical

and machine learning literature have developed a large variety involving varying degrees

of ﬂexibility and computational complexity), and some other algorithms may represent

a researcher’s favorite prediction tool, or simply standard practice. For example, in a

regression problem, it may be known to a subject-matter expert that the relation between a

given predictor and the outcome is logarithmic. In that case, a parametric model including

a logarithmic term may be included in the library.

Our aim is to construct a candidate selector based on predictive power. This aim may

be achieved by deﬁning the estimator as the candidate in the library with the smallest the

prediction error estimate

). That is, the selector is deﬁned as

k =argmin

) (13.3)

Estimator and Model Selection Using Cross-Validation 229

which is referred to as the discrete super learner in [5]. Its corresponding predictive function

is given by

DSL

, that is, the predictive function of the algorithm with the smallest

prediction risk.

For example, consider the problem described in Section 13.2, where we consider the

following library of candidate estimators (we give only brief and intuitive descriptions of

each algorithm; complete discussions may be found in most modern statistical learning

textbooks).

• Logistic regression with main terms (generalized linear model [GLM]). This prediction

method represents standard practice in ﬁelds such as epidemiology. The main idea is to

estimate the coeﬃcients β

in the parametrization

logit η

(x)=β



j=1

where logit(p)=log(p/(1 − p)).

• Generalized additive model (GAM ). The main idea of a GAM is to extend the GLM

by considering a model where the predictors enter nonlinearly in the equation. That is,

consider the following model:

logit η

(x)=β



j=1

)

where the functions f

are assumed smooth but otherwise unspeciﬁed and are typically

estimated using smoothing techniques such as cubic splines or local polynomial

regression.

• SVM. In its simplest form (linear), an SVM is a classiﬁer that separates the X space

using a hyperplane, in a way such that the points Y in each of two categories {0, 1} are

maximally separated. The intuition behind the prediction method is best explained with

a ﬁgure; a toy example for two-dimensional X and binary Y with perfect separation is

presented in Figure 13.2.

• RF. This is a prediction algorithm that runs many classiﬁcation trees. To obtain the

probability

η(x), the vector x is run down each of the trees in the forests, and the

01234

FIGURE 13.2

SVM illustration. The shape of the points deﬁnes the categories, the solid and dotted

lines represent the SVM maximum separation margin. The dashed line provides suboptimal

separation.

230 Handb ook of Big Data

classiﬁcation results are averaged. Each tree is grown by running a classiﬁcation tree on

q  p randomly selected variables on a bootstrap sample from the original data. Each

tree is grown to the largest extent possible.

• Multivariate adaptive regression splines (MARS ). The predictor is built as

(x)=β



s=1

(x)

where each B

is a basis function and S is a tuning parameter. The basis functions take

the form of hinge functions max(0,x

− c

)ormax(0,c

− x

), where c

is a tuning

parameter called knot. Basis functions may also take the form of the product of two or

more hinge functions. The MARS predictor is built in a forward–backward fashion, by

adding and deleting terms to minimize the sum of squared errors.

• Mean. For comparison, we add the most na¨ıve prediction algorithm:

η(x)=

Y .

The cross-validated risk

) of each of these algorithms applied to our illustrating example

is presented in Table 13.4, and may be computed using the SuperLearner package in R (code

available in Appendix A), which uses the mean squared-error function. These estimated risks

indicate that the best predictor for this dataset, among the candidates studied, is the RF.

Figure 13.3 shows the receiver operating characteristic (ROC) curves computed with

cross-validated and non-cross-validated predictions. The area under the curve (AUC) is also

TABLE 13.1

Cross-validated risk for each algorithm applied to

classiﬁcation of chemicals.

GLM GAM SVM RF MARS Mean

0.1003 0.0985 0.1021 0.0933 0.1154 0.224

False positive rate

True positive rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC

GLM 0.945

GAM 0.955

SVM 0.932

RF 1

MARS 0.97

Mean 0.5

False positive rate

True positive rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC

GLM 0.924

GAM 0.924

SVM 0.919

RF 0.934

MARS 0.894

Mean 0.409

(a) (b)

FIGURE 13.3

ROC curve and AUC for each prediction method. (a) Complete sample prediction. (b)

Cross-validated prediction.

Estimator and Model Selection Using Cross-Validation 231

displayed as a performance measure: an area of 1 indicates perfect prediction, whereas an

area of 0.5 indicates that the predictor is not better than a coin toss. This graph illustrates

the importance of using cross-validation. Consider ﬁrst the ROC curves in Figure 13.3a.

In this graph, the RF seems to perfectly ﬁt the data, arising suspicion for overﬁtting

and poor generalization power. Meanwhile, MARS seems to be the best performing

algorithm. Figure 13.3b shows the ROC computed with cross-validated predictions. Note

that, according to Figure 13.3b, the conclusions obtained from Figure 13.3a are wrong. The

RF has the best generalization power, whereas the generalization power of the MARS is the

poorest among all the algorithms considered.

13.5 Model Stacking and Super Learning

In the previous section, we described an algorithm that may be used to select a unique

estimator from a discrete library of candidate algorithms. In this section, we improve discrete

model selection by using an ensemble model that combines the predictions of all the models

in the discrete candidate library. The main idea behind the super learner (or more generally

a stacked predictor) is to use a parametrization

to combine the predictors

. Examples

of this parametrization include

(z)=

⎧

⎪

⎨

⎪

⎩



k=1

(z)

expit





k=1

logit(

(z))



where expit is its inverse of the logit function. The ﬁrst parametrization is often referred

to as a linear model, whereas the second one is referred to as a logistic model, and is often

used for binary regression. In the linear model, one may decide to restrict the values α

the set of positive weights summing up to one. This constraint on the α parameters would

be particularly relevant for estimation of a conditional density function, because it would

guarantee that the ﬁnal estimate is a proper density function.

For simplicity, we focus on parametric models for the ensemble model. In principle,

however, any prediction method could be used to combine the predictors in the discrete

library.

In this formulation, the optimal selection problem may be equivalently framed as

selection of the optimal weights α in a library of candidate estimators given by L =

{

(z):α}. It can be seen that the library used for the discrete super learner of the previous

section is a subset of this library where α

∈{0, 1}. Selection of a candidate estimator in this

library is carried out in a similar fashion to discrete model selection, but the optimization

is performed in an inﬁnite set as opposed to a ﬁnite one. The super learning predictive

function is thus deﬁned as

, where the selector equals

α =argmin

) (13.4)

The optimization in Equation 13.4 is often a convex optimization problem that can be

readily solved using standard software. In particular, computing

α involves only one

232 Handb ook of Big Data

additional step compared to the discrete selection algorithm of the previous section. As

an example, consider the regression problem with a continuous outcome and note that

L(Z

α,T



− α

k,T

)





−



k=1

k,i



where we have introduced a set of variables H

: k =1,...,K

, computed, for unit i,as

k,i

α,T

k,i

). The cross-validated risk is equal to



j=1



i∈V

L(Z

α,T

)



i=1



−



k=1

k,i



Thus, computation of

α in Equation 13.4 is reduced to computation of the ordinary least-

squares estimator of a regression of Y with predictors H

: k =1,...,K

.Notethat

the variables H

must also be computed to evaluate the discrete super learner of the

previous section. As a result, computation of the super learner involves running only one

additional regression, compared to the discrete super learner. This optimization problem

under a constraint on the parameter space such as restricting the values α

to the set of

positive weights summing up to one is implemented in the SuperLearner package in R. In

our chemical classiﬁcation example, the weights of each algorithm in the super learner are

given in Table 13.2.

Note that the RF gets more than half the weight in the predictor, in agreement with

the results from the discrete super learner of the previous section.

13.5.1 Assessing the Prediction Error of a Stacked Predictor

As with any prediction method, the considerations presented in Section 13.3 about over

ﬁtting apply to a stacked predictor such as the super learner. Consequently, cross-validation

provides the correct tool to assess the performance of the super learner. The cross-validated

prediction error is thus given by



j=1



i∈V

L(Z

SL,T

)

where η

SL,T

is the super learner trained using only data in T

. Note that training

SL,T

involves further splitting of T

in validation and training subsamples, which results in larger

computation times for the risk of the super learner. In our chemical classiﬁcation data, we

obtain the results presented in Table 13.3, using the function CV.SuperLearner from the

package SuperLearner (code in Appendix A).

TABLE 13.2

Weights of each prediction algorithm in the chemical

classiﬁcation data.

GLM GAM SVM RF MARS Mean

0.0000 0.2641 0.0109 0.5514 0.1736 0.0000

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13. Estimator and Model Selection Using Cross-Validation (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
13. Estimator and Model Selection Using Cross-Validation (2/4)