22. Targeted Learning for Variable Importance (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

416 Handbook of Big Data

22.3 Road Map for Estimation and Inference

The targeted learning framework provides a template for translating variable importance

research questions into statistical questions, developing and applying estimators, and

assessing uncertainty in the eﬀect measures. We use the motivating examples from the

earlier discussed work in quantitative trait loci mapping [63,64,66] to illustrate this road

map for estimation and inference, as estimated in [66].

Variable Importance Measures

In this chapter, we focus on a TMLE of the variable importance measure described in

Section 22.3 under a semiparametric regression model. This is a ﬂexible deﬁnition that

can handle both continuous and binary list variables. While we use quantitative trait loci

mapping to illustrate the methodology concretely, the applications of these methods are

vast. As discussed earlier, the list of variables could involve clinical or epidemiological

data [6,45], and these tools also have important implications for testing for possible eﬀect

modiﬁcation (e.g., an intervention modiﬁed by the variables in the list) and in controlled

randomized trial data [61].

22.3.1 Deﬁning the Research Question

The ﬁrst step is to deﬁne the research question, which includes accurately specifying your

data, model, and target parameters. Recall that we are interested in understanding which

quantitative trait loci underlie a particular phenotypic trait value. Quantitative trait loci

mapping for experimental organisms typically involves crossing two inbred lines that have

substantial diﬀerences in a trait. The trait is then scored in the segregating progeny.

Markers along the genome are genotyped in the segregating progeny, and associations

between the trait and the quantitative trait loci are evaluated. The positions and eﬀect

sizes of quantitative trait loci are of primary interest. Typical segregating designs include the

backcross design, the intercross (F2) design, and the double haploid (DH) design. Backcross

is produced by back-crossing the ﬁrst generation (F1) to one of its parental strains, and

there are two possible genotypes, Aa and aa at any locus. For the ease of presentation, as the

authors do in the original work [66], we focus most heavily on backcross to demonstrate our

method. All the derivations can be readily extended to F2 and other types of experimental

crosses.

22.3.1.1 Data

The observed data are given as n i.i.d. realizations of

=(Y

) ∼ P

i =1,...,n

Here, Y is the phenotypic trait value and M is a vector of the marker genotypic values, with

i indexing the ith subject and the 0 subscript, indicating that P

is the true distribution

of the data. The true probability distribution P

is contained within the set of possible

probability distributions that make up the statistical model M.

We introduce the notation A to represent the genotypic value of the quantitative trait

loci currently under consideration. A is observed when it lies on a marker, although it

can also lie between markers, where it will be unobserved. When A is unobserved, it is

imputed using the expected value returned from a multinomial distribution computed from

Targeted Learning for Variable Importance 417

Deﬁning the Research Question

Data:

n i.i.d.

observations

of O ∼ P

Model:

Statistical model M is

set of possible

probability distributions

of O. True P

in M.

Model is statistical

model augmented with

possible causal

assumptions.

Target Parameters:

Parameters Ψ (P

) are

features of P

Ψ maps probability

distribution P

into the

target parameters.

the locations and genotypes locations of the ﬂanking markers. This is also the approach

used in Haley–Knott regression [11]. In this case, the eﬀect is therefore only an estimate of

the eﬀect of imputed A for these locations.

22.3.1.2 Model and Parameter

We use a semiparametric model that assumes that the phenotypic trait changes linearly

with the quantitative trait loci. This regression model for the eﬀect of A at a value A = a

relative to A = 0, adjusted for the set of other markers, denoted M

−

,is

(Y | A = a, M

−

) − E

(Y | A =0,M

−

)=β

a (22.1)

Our target parameter is therefore β

, which is also equivalent to the average marginal eﬀect

given by averaging this conditional eﬀect over the distribution of M

−

. The target parameter

is deﬁned formally as a mapping Ψ : M→R that maps the probability distribution of the

data into the (ﬁnite dimensional) feature of interest Ψ(P

)=β

. Additional discussion of

this parameter can be found in earlier literature [49,63].

For our application, the parameter measures the diﬀerence in the phenotypic trait

outcome Y when A shifts from heterozygote to homozygote. This can be understood due

to the coding: aa (homozygote) is given the value 0 and Aa (heterozygote) is set to 1 in a

backcross population and, for an F2 population, the coding is (AA, Aa, aa)=(1, 0, −1).

The linearity assumption discussed above can be seen explicitly in Equation 22.1. It

is important to stress that only the eﬀect of our genotypic value A on the mean outcome

of quantitative trait loci Y is modeled using a parametric form in the semiparametric

model. We do not impose any distributional assumptions on the data. We also do not make

assumptions about the functional form of all functions f(M

−

)ofM

−

. We do additionally

make the assumption that A is not a perfect surrogate of M

−

in order for the parameter to

be well deﬁned and estimable. Finally, we make the positivity assumption 0 <P

(A = a |

−

) < 1. The model given in Equation 22.1 is general and may be speciﬁed in alternative

ways, depending on the target parameter of interest. To include eﬀect modiﬁcation by

markers V

, we would write: a



j=1

22.3.1.3 Causal Assumptions

We do not discuss causal assumptions in detail here for brevity and also given that, in many

variable importance settings, these causal assumptions will be violated. In particular, the

no unmeasured confounding assumption (also referred to as the randomization assumption)

will frequently not hold. Researchers may also not be interested in drawing causal inferences

in variable importance settings. However, a case could be made that the exercise of walking

through the causal assumptions and articulating the role of endogenous and exogenous

418 Handbook of Big Data

variables in nonparametric structural equations is still useful for a full description of

the research question. One could then decide not to augment the statistical model with

additional untestable causal assumptions. For a thorough treatment of causal assumptions,

nonparametric structural equation models, and directed acyclic graphs, we refer to other

literature [23,61].

22.3.2 Estimation

The TMLE procedure builds on the foundation established by maximum likelihood

estimation and proceeds in two steps. In the ﬁrst step, we obtain an ensemble machine

learning-based estimator of the data-generating distribution. Super learning is appropriate

for this task [32,58,61]. It allows the user to consider multiple algorithms, without the need

to select the best algorithm a priori. The super learner returns the best weighted combination

of the algorithms considered, selected based on a chosen loss function. Cross-validation is

employed to protect against overﬁtting. The second stage of TMLE ﬂuctuates the initial

super learner-based estimator in a submodel focusing on the optimal bias–variance trade-oﬀ

for the target parameter. This second step can also be thought of as a bias reduction step. We

must reduce the bias remaining in the initial estimator for the target parameter, since it was

ﬁtted based on a bias–variance trade-oﬀ for the data-generating distribution, not the target

parameter.

The procedure can also be understood intuitively in the context of our motivating quanti-

tative trait loci example as well. In stage one, the conditional expectation for the phenotypic

trait value Y given the vector M is not targeted toward our parameter of interest. Here,

its bias–variance trade-oﬀ is for the overall density. Stage two incorporates the conditional

expectation for genotypic value A of the quantitative trait loci currently being considered to

shrink the bias of the conditional expectation of Y , our initial estimate. We now also intro-

duce a subset of M

−

denoted W for each A. The vector W contains the subset of markers

that are potential confounders for the eﬀect of genotypic value A on phenotypic trait Y .

To deﬁne our TMLE concretely for this problem, we must begin by calculating the

pathwise derivative of our parameter Ψ(P )=β at P and its corresponding canonical

gradient (eﬃcient inﬂuence curve) D(P, O):

D(P, O)=

(A, W )

h(A, W )(Y − Q(A, W )) (22.2)

where

h(A, W )=

dβ

m(A | β) −

dβ

m(A | β)/σ

(A, W ) | W )

E(1/σ

(A, W ) | W )

(22.3)

and σ

(A, W ) is the conditional variance of Y given A and W .

Estimation

Super Learner:

For each target parameter, obtain initial

estimate of the relevant part Q

of P

using super learner.

TMLE:

Second stage of TMLE updates initial ﬁt

in a step targeted to make an optimal

bias–variance tradeoﬀ for the parameter

under consideration, now denoted

Ψ (Q

). Procedure repeated for all target

parameters.

Targeted Learning for Variable Importance 419

The TMLE requires choosing a loss function L(O, Q) for candidate function Q applied

to an observation O and then specifying a submodel {Q():}⊂Mto ﬂuctuate the initial

estimator. Here, we use the squared-error loss function:

L(O, Q)=

(Y − Q(A, W ))

(A, W )

The submodel {Q():}⊂Mthrough Q at  = 0 is selected such that the linear span of

d/d L(Q()) at  = 0 includes the eﬃcient inﬂuence curve in Equation 22.2. The speciﬁc

steps of the TMLE algorithm for the target parameter β

are enumerated below.

22.3.2.1 TMLE Algorithm for Quantitative Trait Loci Mapping

Estimating E

(Y | A, M

−

)=Q

(A, M

−

). Generate a super learner-based initial esti-

mator that respects the semiparametric model in Equation 22.1 and also takes the form

= β

A + f

−

)

We introduce the subscript n to denote estimators and estimates.

Estimating E

(A | W )=g

(W ). Recall that we introduced a subset W of M

−

for each

A.Thus,M

−

is replaced with W and we can refer to the function g

(W )=E

(A | W )

as a marker confounding mechanism. For the applications considered here, as in Wang et

al. [66], the set of markers W are those that lie on the same chromosome as A.

However, the choice for E

(A | W ), in general, is still a complicated one. The selection

of ﬂanking markers to include in the marker confounding mechanism can be further

simpliﬁed to including only two ﬂanking markers, possibly capturing a good portion of

the confounding. But, there is still then the issue of distance for A for the selection of

these two ﬂanking markers. Those that are too close to A may be too predictive of A,

thus failing to isolate the contribution of A when estimating β

. On the other hand, if the

selected markers are too great a distance from A, they may not contribute to reducing

bias for the target parameter of interest. Collaborative TMLE, as discussed brieﬂy in our

literature review, may also be employed to data-adaptively select the most appropriate

adjustment set. We leave further discussion of this issue to other literature [49,63].

Determine parametric working model to ﬂuctuate initial estimator. The targeted

step uses an estimate g

(W )ofg

(W ) to correct the bias remaining in the initial estimator.

This involves deﬁning a so-called clever covariate in a parametric working model coding

ﬂuctuations of our initial estimator Q

. For our parameter β

, the clever covariate is

given by

h(A, W )=A − g

(W )

the residual of g

(W ), under a condition we describe below.

The clever covariate h(A, W ) was deﬁned earlier in Equation 22.3 and derived based on

the eﬃcient inﬂuence curve in Equation 22.2. When σ

(A, W ) is a function of W only,

it drops out of the eﬃcient inﬂuence curve. We choose to estimate σ

(A, W )withthe

constant 1, which gives us the simpliﬁed clever covariate h(A, W )=A −g

(W )asabove.

The estimation of the nuisance parameter σ

(A, W ) does not impact the consistency

properties of the TMLE, but TMLE will only be eﬃcient if, in addition to estimating Q

and g

consistently, σ

(A, W ) is in fact only a function of W [49].

Update Q

. The regression of Y on h(A, W ) can be reformulated as



∼ h(A, W )

420 Handbook of Big Data

where



= Y − Q

(A, M

−

)

The estimate of the regression coeﬃcient is denoted 

. Our initial estimate β

is updated

with 

= β

+ 

Convergence of the algorithm for this target parameter occurs in one step. Since our TMLE

is double robust, we have the following properties for this estimator of β

:thisTMLEis

(1) consistent when either Q

or g

(W ) is consistent and (2) is eﬃcient when both Q

and

(W ) are consistent (and σ

(A, W ) is a function of W only).

Implementation Summary

The TMLE of the target parameter β

, deﬁned in Equation 22.1, requires an initial

ﬁt of E

(Y | M). Our best ﬁt of E

(Y | M) will be based on minimizing the chosen error

loss function. This initial estimator yields a ﬁt of E

(Y | A =0,M

−

), which we can map

to a ﬁrst-stage estimator of β

in our semiparametric model.

We now complete the second-stage targeted updating step. This single update

(convergence is achieved in one step) is completed by ﬁtting a coeﬃcient  in front

of an estimate of A − E

(A | W ) with univariate regression, using the initial estimator

of E

(Y | A, M

−

) as an oﬀset. We can show that the TMLE of β

is β

+ 

22.3.3 Inference

The variance σ

for each variable importance measure β

can be calculated using inﬂuence-

curve-based methods [66], with the variance σ

given by



− Q

−

))

h(A

)

(



h(A

))

A detailed discussion of multiple hypothesis testing and inference for variable importance

measures is presented in [7]. The authors in the corresponding quantitative trait loci work

[66] adjusted for multiple testing using the false discovery rate and interpreted each variable

importance measure as a W -adjusted eﬀect estimate.

In general, variance estimates for TMLE rely on δ-method conditions [61,62], and, as

such, the asymptotic normal limit distribution of the estimator is characterized by its

inﬂuence curve. The estimator β

of our target parameter is asymptotically linear; therefore,

it behaves as an empirical mean, with bias converging to 0 in sample size faster than a rate

of 1/

√

n and is approximately normally distributed (for sample size n reasonably large).

The variance of the estimator is thus well approximated by the variance of the inﬂuence

curve divided by n. One can also use the covariance in variable importance questions with

a multivariate vector of parameters, where the covariance matrix of the estimator vector

is well approximated by the covariance matrix of the corresponding multivariate inﬂuence

curve divided by n [61].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 22. Targeted Learning for Variable Importance (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
22. Targeted Learning for Variable Importance (2/4)