416 Handbook of Big Data
22.3 Road Map for Estimation and Inference
The targeted learning framework provides a template for translating variable importance
research questions into statistical questions, developing and applying estimators, and
assessing uncertainty in the effect measures. We use the motivating examples from the
earlier discussed work in quantitative trait loci mapping [63,64,66] to illustrate this road
map for estimation and inference, as estimated in [66].
Variable Importance Measures
In this chapter, we focus on a TMLE of the variable importance measure described in
Section 22.3 under a semiparametric regression model. This is a flexible definition that
can handle both continuous and binary list variables. While we use quantitative trait loci
mapping to illustrate the methodology concretely, the applications of these methods are
vast. As discussed earlier, the list of variables could involve clinical or epidemiological
data [6,45], and these tools also have important implications for testing for possible effect
modification (e.g., an intervention modified by the variables in the list) and in controlled
randomized trial data [61].
22.3.1 Defining the Research Question
The first step is to define the research question, which includes accurately specifying your
data, model, and target parameters. Recall that we are interested in understanding which
quantitative trait loci underlie a particular phenotypic trait value. Quantitative trait loci
mapping for experimental organisms typically involves crossing two inbred lines that have
substantial differences in a trait. The trait is then scored in the segregating progeny.
Markers along the genome are genotyped in the segregating progeny, and associations
between the trait and the quantitative trait loci are evaluated. The positions and effect
sizes of quantitative trait loci are of primary interest. Typical segregating designs include the
backcross design, the intercross (F2) design, and the double haploid (DH) design. Backcross
is produced by back-crossing the first generation (F1) to one of its parental strains, and
there are two possible genotypes, Aa and aa at any locus. For the ease of presentation, as the
authors do in the original work [66], we focus most heavily on backcross to demonstrate our
method. All the derivations can be readily extended to F2 and other types of experimental
crosses.
22.3.1.1 Data
The observed data are given as n i.i.d. realizations of
O
i
=(Y
i
,M
i
) P
0
i =1,...,n
Here, Y is the phenotypic trait value and M is a vector of the marker genotypic values, with
i indexing the ith subject and the 0 subscript, indicating that P
0
is the true distribution
of the data. The true probability distribution P
0
is contained within the set of possible
probability distributions that make up the statistical model M.
We introduce the notation A to represent the genotypic value of the quantitative trait
loci currently under consideration. A is observed when it lies on a marker, although it
can also lie between markers, where it will be unobserved. When A is unobserved, it is
imputed using the expected value returned from a multinomial distribution computed from
Targeted Learning for Variable Importance 417
Defining the Research Question
Data:
n i.i.d.
observations
of O P
0.
Model:
Statistical model M is
set of possible
probability distributions
of O. True P
0
in M.
Model is statistical
model augmented with
possible causal
assumptions.
Target Parameters:
Parameters Ψ (P
0
) are
features of P
0
.
Ψ maps probability
distribution P
0
into the
target parameters.
the locations and genotypes locations of the flanking markers. This is also the approach
used in Haley–Knott regression [11]. In this case, the effect is therefore only an estimate of
the effect of imputed A for these locations.
22.3.1.2 Model and Parameter
We use a semiparametric model that assumes that the phenotypic trait changes linearly
with the quantitative trait loci. This regression model for the effect of A at a value A = a
relative to A = 0, adjusted for the set of other markers, denoted M
,is
E
0
(Y | A = a, M
) E
0
(Y | A =0,M
)=β
0
a (22.1)
Our target parameter is therefore β
0
, which is also equivalent to the average marginal effect
given by averaging this conditional effect over the distribution of M
. The target parameter
is defined formally as a mapping Ψ : M→R that maps the probability distribution of the
data into the (finite dimensional) feature of interest Ψ(P
0
)=β
0
. Additional discussion of
this parameter can be found in earlier literature [49,63].
For our application, the parameter measures the difference in the phenotypic trait
outcome Y when A shifts from heterozygote to homozygote. This can be understood due
to the coding: aa (homozygote) is given the value 0 and Aa (heterozygote) is set to 1 in a
backcross population and, for an F2 population, the coding is (AA, Aa, aa)=(1, 0, 1).
The linearity assumption discussed above can be seen explicitly in Equation 22.1. It
is important to stress that only the effect of our genotypic value A on the mean outcome
of quantitative trait loci Y is modeled using a parametric form in the semiparametric
model. We do not impose any distributional assumptions on the data. We also do not make
assumptions about the functional form of all functions f(M
)ofM
. We do additionally
make the assumption that A is not a perfect surrogate of M
in order for the parameter to
be well defined and estimable. Finally, we make the positivity assumption 0 <P
0
(A = a |
M
) < 1. The model given in Equation 22.1 is general and may be specified in alternative
ways, depending on the target parameter of interest. To include effect modification by
markers V
j
, we would write: a
J
j=1
β
j
V
j
.
22.3.1.3 Causal Assumptions
We do not discuss causal assumptions in detail here for brevity and also given that, in many
variable importance settings, these causal assumptions will be violated. In particular, the
no unmeasured confounding assumption (also referred to as the randomization assumption)
will frequently not hold. Researchers may also not be interested in drawing causal inferences
in variable importance settings. However, a case could be made that the exercise of walking
through the causal assumptions and articulating the role of endogenous and exogenous
418 Handbook of Big Data
variables in nonparametric structural equations is still useful for a full description of
the research question. One could then decide not to augment the statistical model with
additional untestable causal assumptions. For a thorough treatment of causal assumptions,
nonparametric structural equation models, and directed acyclic graphs, we refer to other
literature [23,61].
22.3.2 Estimation
The TMLE procedure builds on the foundation established by maximum likelihood
estimation and proceeds in two steps. In the first step, we obtain an ensemble machine
learning-based estimator of the data-generating distribution. Super learning is appropriate
for this task [32,58,61]. It allows the user to consider multiple algorithms, without the need
to select the best algorithm a priori. The super learner returns the best weighted combination
of the algorithms considered, selected based on a chosen loss function. Cross-validation is
employed to protect against overfitting. The second stage of TMLE fluctuates the initial
super learner-based estimator in a submodel focusing on the optimal bias–variance trade-off
for the target parameter. This second step can also be thought of as a bias reduction step. We
must reduce the bias remaining in the initial estimator for the target parameter, since it was
fitted based on a bias–variance trade-off for the data-generating distribution, not the target
parameter.
The procedure can also be understood intuitively in the context of our motivating quanti-
tative trait loci example as well. In stage one, the conditional expectation for the phenotypic
trait value Y given the vector M is not targeted toward our parameter of interest. Here,
its bias–variance trade-off is for the overall density. Stage two incorporates the conditional
expectation for genotypic value A of the quantitative trait loci currently being considered to
shrink the bias of the conditional expectation of Y , our initial estimate. We now also intro-
duce a subset of M
denoted W for each A. The vector W contains the subset of markers
that are potential confounders for the effect of genotypic value A on phenotypic trait Y .
To define our TMLE concretely for this problem, we must begin by calculating the
pathwise derivative of our parameter Ψ(P )=β at P and its corresponding canonical
gradient (efficient influence curve) D(P, O):
D(P, O)=
1
σ
2
(A, W )
h(A, W )(Y Q(A, W )) (22.2)
where
h(A, W )=
d
dβ
m(A | β)
E(
d
dβ
m(A | β)/σ
2
(A, W ) | W )
E(1/σ
2
(A, W ) | W )
(22.3)
and σ
2
(A, W ) is the conditional variance of Y given A and W .
Estimation
Super Learner:
For each target parameter, obtain initial
estimate of the relevant part Q
0
of P
0
using super learner.
TMLE:
Second stage of TMLE updates initial fit
in a step targeted to make an optimal
bias–variance tradeoff for the parameter
under consideration, now denoted
Ψ (Q
0
). Procedure repeated for all target
parameters.
Targeted Learning for Variable Importance 419
The TMLE requires choosing a loss function L(O, Q) for candidate function Q applied
to an observation O and then specifying a submodel {Q():}⊂Mto fluctuate the initial
estimator. Here, we use the squared-error loss function:
L(O, Q)=
(Y Q(A, W ))
2
σ
2
(A, W )
The submodel {Q():}⊂Mthrough Q at = 0 is selected such that the linear span of
d/d L(Q()) at = 0 includes the efficient influence curve in Equation 22.2. The specific
steps of the TMLE algorithm for the target parameter β
0
are enumerated below.
22.3.2.1 TMLE Algorithm for Quantitative Trait Loci Mapping
Estimating E
0
(Y | A, M
)=Q
0
(A, M
). Generate a super learner-based initial esti-
mator that respects the semiparametric model in Equation 22.1 and also takes the form
Q
0
n
= β
0
n
A + f
n
(M
)
We introduce the subscript n to denote estimators and estimates.
Estimating E
0
(A | W )=g
0
(W ). Recall that we introduced a subset W of M
for each
A.Thus,M
is replaced with W and we can refer to the function g
0
(W )=E
0
(A | W )
as a marker confounding mechanism. For the applications considered here, as in Wang et
al. [66], the set of markers W are those that lie on the same chromosome as A.
However, the choice for E
0
(A | W ), in general, is still a complicated one. The selection
of flanking markers to include in the marker confounding mechanism can be further
simplified to including only two flanking markers, possibly capturing a good portion of
the confounding. But, there is still then the issue of distance for A for the selection of
these two flanking markers. Those that are too close to A may be too predictive of A,
thus failing to isolate the contribution of A when estimating β
0
. On the other hand, if the
selected markers are too great a distance from A, they may not contribute to reducing
bias for the target parameter of interest. Collaborative TMLE, as discussed briefly in our
literature review, may also be employed to data-adaptively select the most appropriate
adjustment set. We leave further discussion of this issue to other literature [49,63].
Determine parametric working model to fluctuate initial estimator. The targeted
step uses an estimate g
n
(W )ofg
0
(W ) to correct the bias remaining in the initial estimator.
This involves defining a so-called clever covariate in a parametric working model coding
fluctuations of our initial estimator Q
0
n
. For our parameter β
0
, the clever covariate is
given by
h(A, W )=A g
n
(W )
the residual of g
n
(W ), under a condition we describe below.
The clever covariate h(A, W ) was defined earlier in Equation 22.3 and derived based on
the efficient influence curve in Equation 22.2. When σ
2
(A, W ) is a function of W only,
it drops out of the efficient influence curve. We choose to estimate σ
2
(A, W )withthe
constant 1, which gives us the simplified clever covariate h(A, W )=A g
n
(W )asabove.
The estimation of the nuisance parameter σ
2
(A, W ) does not impact the consistency
properties of the TMLE, but TMLE will only be efficient if, in addition to estimating Q
0
and g
0
consistently, σ
2
(A, W ) is in fact only a function of W [49].
Update Q
0
n
. The regression of Y on h(A, W ) can be reformulated as
Y
h(A, W )
420 Handbook of Big Data
where
Y
= Y Q
0
n
(A, M
)
The estimate of the regression coefficient is denoted
n
. Our initial estimate β
0
n
is updated
with
n
:
β
1
n
= β
0
n
+
n
Convergence of the algorithm for this target parameter occurs in one step. Since our TMLE
is double robust, we have the following properties for this estimator of β
0
:thisTMLEis
(1) consistent when either Q
0
n
or g
n
(W ) is consistent and (2) is efficient when both Q
0
n
and
g
n
(W ) are consistent (and σ
2
(A, W ) is a function of W only).
Implementation Summary
The TMLE of the target parameter β
0
, defined in Equation 22.1, requires an initial
fit of E
0
(Y | M). Our best fit of E
0
(Y | M) will be based on minimizing the chosen error
loss function. This initial estimator yields a fit of E
0
(Y | A =0,M
), which we can map
to a first-stage estimator of β
0
in our semiparametric model.
We now complete the second-stage targeted updating step. This single update
(convergence is achieved in one step) is completed by fitting a coefficient in front
of an estimate of A E
0
(A | W ) with univariate regression, using the initial estimator
of E
0
(Y | A, M
) as an offset. We can show that the TMLE of β
0
is β
0
n
+
n
.
22.3.3 Inference
The variance σ
2
n
for each variable importance measure β
1
n
can be calculated using influence-
curve-based methods [66], with the variance σ
2
n
given by
σ
2
n
=
i
(Y
i
Q
1
n
(A
i
,M
i
))
2
h(A
i
,W
i
)
2
(
i
A
i
h(A
i
,W
i
))
2
A detailed discussion of multiple hypothesis testing and inference for variable importance
measures is presented in [7]. The authors in the corresponding quantitative trait loci work
[66] adjusted for multiple testing using the false discovery rate and interpreted each variable
importance measure as a W -adjusted effect estimate.
In general, variance estimates for TMLE rely on δ-method conditions [61,62], and, as
such, the asymptotic normal limit distribution of the estimator is characterized by its
influence curve. The estimator β
1
n
of our target parameter is asymptotically linear; therefore,
it behaves as an empirical mean, with bias converging to 0 in sample size faster than a rate
of 1/
n and is approximately normally distributed (for sample size n reasonably large).
The variance of the estimator is thus well approximated by the variance of the influence
curve divided by n. One can also use the covariance in variable importance questions with
a multivariate vector of parameters, where the covariance matrix of the estimator vector
is well approximated by the covariance matrix of the corresponding multivariate influence
curve divided by n [61].
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset