23
Online Estimation of the Average Treatment Effect
Sam Lendle
CONTENTS
23.1 Introduction .................................................................... 429
23.2 Preliminaries ................................................................... 430
23.2.1 Causal Parameter and Assumptions .................................. 430
23.2.2 Statistical Model ....................................................... 430
23.3 Batch Estimation of the ATE .................................................. 431
23.4 Online One-Step Estimation of the ATE ...................................... 432
23.5 Example and Simulation ....................................................... 433
23.5.1 Data-Generating Distribution ......................................... 433
23.5.2 SGD for Logistic Regression ........................................... 434
23.5.3 Results ................................................................. 434
23.6 Discussion ...................................................................... 436
References ............................................................................. 437
23.1 Introduction
Drawing causal inferences from observational data requires making strong assumptions
about the causal process from which the data are generated, followed by a statistical analysis
of the observational dataset. Though we must make causal assumptions, we often know
little about the data-generating distribution. This means we generally cannot make strong
statistical assumptions so we estimate a statistical parameter in a nonparametric or semi-
parametric statistical model. Semiparametric efficient estimators, that is, estimators that
achieve the minimum asymptotic variance bound, such as augmented inverse probability of
treatment weighted (A-IPTW) estimators [11] and targeted minimum loss-based estimators
(TMLE) [15,18], have been developed for a variety of statistical parameters with applications
in causal inference.
Typically, the computational efficiency and scalability of these estimators are not taken
into account. Borrowing language from the large-scale machine learning literature, we call
these batch estimators, because they process the entire dataset at one time. They rely
on estimation of one or more parts of the data-generating distribution. With traditional
statistical methods, estimating each of these parts may require many passes through the
data that can quickly become impractical as the size of datasets grow.
In this chapter, we demonstrate an online method for estimating the average treatment
effect (ATE) that is doubly robust and statistically efficient with only a single pass
through the dataset. In Section 23.2, we introduce the observed data structure, the
causal parameter, causal assumptions required for identification of the parameter, and the
429
430 Handbook of Big Data
statistical parameter. In Section 23.3, we review a batch method for estimating the ATE.
In Section 23.4, we describe an online approach to estimating the ATE. In Section 23.5, we
demonstrate the performance of the online estimator in simulations. We conclude with a
discussion of extensions and future work in Section 23.6.
23.2 Preliminaries
23.2.1 Causal Parameter and Assumptions
We define the ATE using the counterfactual framework [14]. For a single observation, let
Y
a
be the counterfactual value of some outcome had exposure A been set to level a for
a ∈{0, 1}. These values are called counterfactual because we can only observe a sample’s
outcome under the observed treatment that it received. Because at least one of Y
1
or Y
0
is
not observed, we can never calculate for a given observation the value Y
1
Y
0
,whichcanbe
interpreted as the treatment effect for that observation. Under some conditions, however,
we can estimate the average of this quantity, E
0
(Y
1
Y
0
). This is known as the ATE, where
E
0
denotes expectation with respect to the true distribution of the counterfactual random
variables.
Before attempting to estimate the ATE, we first consider the structure of our observed
dataset. Let O =(W, A, Y ) be an observed sample where W is a vector of covariates
measured before A, a binary exposure or treatment, and Y is the observed outcome. We
make the counterfactual consistency assumption that the observed outcome Y is equal to
the counterfactual under the observed treatment A. That is, we assume Y = Y
A
.
The ATE is a function of the distribution of counterfactuals Y
1
and Y
0
. To estimate
the ATE, we must be able to write it as a function of the distribution of the observed
data. When we can do this, the ATE is said to be identifiable. To do this, we need
to make some assumptions. The first is the randomization assumption where we assume
A (Y
1
,Y
0
) | W . That is, we assume that if there are any common causes of the exposure
A and outcome Y , they are measured and included in W . This is sometimes called the
no unmeasured confounders assumption. This is called an untestable assumption, because
it is not possible to test if this assumption is true using the observed data. Making the
randomization assumption requires expert domain knowledge and careful study design. We
also make the experimental treatment assignment assumption or positivity assumption: that
for any value of W , there is some possibility of either treatment being assigned. Formally,
we assume 0 <P
0
(A =1| W ) < 1 almost everywhere. Under these assumptions, we can
write the ATE as [12]
E
0
(Y
1
Y
0
)=E
0
[E
0
(Y | A =1,W) E
0
(Y | A =0,W)]
23.2.2 Statistical Model
Now that we have posed some causal assumptions that allow us to write the ATE as a
parameter of the distribution of the observed data, we need to specify a statistical model
and target parameter. A statistical model M is a set of possible probability distributions
of the observed data.
Suppose that we observe a dataset of n independent and identically distributed
observations, O
1
,...,O
n
, with distribution P
0
. For a distribution P ∈M,letp = dP/dμ
Online Estimation of the Average Treatment Effect 431
be the density of O with respect to some dominating measure μ. We can factorize the
density as
p(o)=Q
W
(w)G(a | w)Q
Y
(y | a, w)
where:
Q
W
is the marginal density of W
G is the conditional probability that A = a ,givenW
Q
Y
is the conditional density of Y ,givenA and W
Let Q =(Q
Y
,Q
W
)and
¯
Q(a, w)=E
Q
Y
(Y | A = a, W = w). We can parameterize the
model as M = {P : Q ∈Q,G ∈G}.
The randomization assumption puts no restriction on the distribution of the observed
data, and the positivity assumption only requires that G(1 | W ) be bounded away from
0 and 1. To ensure the true distribution P
0
is in M, we make no additional assumptions
on Q so Q is nonparametric. In some cases, we may know something about the treatment
mechanism G. For instance, we may know that the probability of treatment only depends
on a subset of covariates. In that case, we put some restriction on the set G in addition to
assuming 0 <G(1 | W ) < 0. In general, our model M is semiparametric.
We define the parameter mapping Ψ : M→R as
Ψ(P )=E
P
[E
P
(Y | A =1,W) E
P
(Y | A =0,W)]
where E
P
denotes expectation with respect to the distribution P .Letψ (P )bethe
parameter mapping applied to distribution P . The target parameter we wish to estimate is
ψ
0
(P
0
), the parameter mapping applied to the true distribution. We note that Ψ(P )
only depends on P through Q, so recognizing the abuse of notation, we sometimes write
Ψ(Q). Throughout, we will use subscript n to denote that a quantity is an estimate based
on n observations, and subscript 0 to denote the truth. For example
¯
Q
n
is an estimate of
¯
Q
0
, defined as E
0
(Y | A = a, W = w ), where E
0
is expectation with respect to the true
distribution P
0
.
23.3 Batch Estimation of the ATE
An asymptotically linear estimator is one that can be written as a sum of some mean
zero function called an influence curve plus a small remainder. An efficient estimator is
an estimator that achieves the minimum asymptotic variance among the class of regular
estimators. In particular, an efficient estimator is asymptotically linear with the influence
curve equal to the efficient influence curve, which depends on the particular parameter and
model. The asymptotic variance of an efficient estimator is the variance of the efficient
influence curve [2]. For our model M and parameter mapping Ψ, the efficient influence
curve is given by
D
(P )(O)=
2A 1
G(A | W )
(Y
¯
Q(A, W )) +
¯
Q(1,W)
¯
Q(0,W) Ψ(Q).
To denote the dependence of D
on P through Q and G, we sometimes also write D
(Q, G).
There are many ways to construct an efficient estimator, for example TMLE or A-IPTW.
We now review the A-IPTW estimator in the batch setting. An A-IPTW estimate is calcu-
lated as
ψ
n
=
1
n
n
i=1
2A
i
1
G
n
(A
i
| W
i
)
(Y
i
¯
Q
n
(A
i
,W
i
)) +
¯
Q
n
(1,W
i
)
¯
Q
n
(0,W
i
)
432 Handbook of Big Data
where
¯
Q
n
and G
n
are estimates of Q
0
and G
0
, respectively. The A-IPTW estimator treats
D
(P ) as an estimating function in ψ and nuisance parameters Q and G and solves for ψ.
The A-IPTW is also a one-step estimator, which starts with a plug in estimator for ψ
0
and
takes a step in the direction of the efficient influence curve. That is,
ψ
n
(Q
n
)+
1
n
n
i=1
D
(Q
n
,G
n
)
where
Ψ(Q
n
)=
1
n
n
i=1
¯
Q
n
(1,W
i
)
¯
Q
n
(0,W
i
).
Under regularity conditions, the A-IPTW estimator is efficient if both
¯
Q
n
and G
n
are
consistent for
¯
Q
0
and G
0
, respectively. Additionally, the A-IPTW estimator is doubly robust,
meaning that if either of
¯
Q
n
or G
n
is consistent, then ψ
n
is consistent.
23.4 Online One-Step Estimation of the ATE
The batch A-IPTW in Section 23.3 has some nice statistical properties. In particular, it is
efficient and doubly robust. Our goal now is to construct an estimator of ψ
0
that has these
same properties, but we only want to make a single pass through the dataset. Additionally,
we only want to process a relatively small number of observations, a minibatch,atonetime.
Let 0 = n
0
<n
1
< ···<n
K
= n.Heren = n
K
represents the total sample size, and
n
j
is the sample size accumulated up to minibatch j. Suppose n
j
n
j1
is bounded for
all j, and for simplicity let n
j
n
j1
= m be constant. Let O
n
i
:n
j
denote the observations
O
n
i
+1
,O
n
i
+2
,...,O
n
j
for i<j.
In Section 23.3, we computed an estimate of ψ
0
with estimates
¯
Q
n
and G
n
,whichwere
fit on the full dataset. Now suppose we have estimates
¯
Q
n
j1
and G
n
j1
for
¯
Q
0
and G
0
,
respectively, that are based on observations O
n
0
:n
j1
. We will return to the problem of
computing
¯
Q
n
j1
and G
n
j1
later. Using those estimates of
¯
Q
0
and G
0
, we compute a new
estimate of ψ
0
on the next minibatch as
ψ
n
j1
:n
j
=
1
m
n
j
i=n
j1
+1
2A
i
1
G
n
j1
(A
i
| W
i
)
(Y
i
¯
Q
n
j1
(A
i
,W
i
))+
¯
Q
n
j1
(1,W
i
)
¯
Q
n
j1
(0,W
i
)
.
That is, ψ
n
j1
:n
j
is a one-step estimator computed on minibatch j using initial estimates of
¯
Q
0
and G
0
from the previous minibatches. We compute the final estimate of ψ
0
by taking
the mean of estimates ψ
n
j1
:n
j
from each minibatch. Let
ψ
n
K
=
1
K
K
j=1
ψ
n
j1
:n
j
,
and call this procedure the online one-step (OLOS) estimator.
Online Estimation of the Average Treatment Effect 433
Under regularity conditions and if
¯
Q
n
K
and G
n
K
converge faster than rate n
(1/4)
K
and
are both consistent for
¯
Q
0
and G
0
, then the OLOS estimator is asymptotically efficient as
K →∞. Additionally, the OLOS estimator has the double robustness property, so if either
of
¯
Q
n
K
or G
n
K
is consistent, then ψ
n
K
is consistent [16, Theorem 1].
We now turn to estimating
¯
Q
0
and G
0
, both of which are conditional means. To be truly
scalable, ideally we want an estimation procedure that has a constant computational time
and storage per minibatch up to K, but we need that the estimates converge fast enough
as K →∞. This rules out estimates of
¯
Q
0
and G
0
that are fit on data from one or a fixed
number of minibatches. Stochastic gradient descent-based methods, however, can achieve
an appropriate rate of convergence in some circumstances [8,20].
Stochastic gradient descent (SGD) is an optimization procedure similar to traditional
batch gradient descent, where the gradient of the objective function for the whole dataset is
replaced by the gradient of the objective function at a single observation (or a minibatch).
The convergence rate to the empirical optimum of the objective function in terms of number
of iterations is very poor relative to batch gradient descent. However, a single iteration of
SGD or minibatch gradient descent takes constant time regardless of the size of the dataset,
while a single iteration of batch gradient descent takes O(n) time [4].
SGD can be used to fit the parameters of generalized linear models (GLMs). Despite the
slow convergence rate, with an appropriately chosen step size, the parameters of a GLM fit
with SGD can achieve
n
K
consistency in a single pass [8]. If curvature information is taken
into account, parameters fit by so-called second-order SGD can achieve the same variance
as directly optimizing the empirical objective function [8], but this is often computationally
infeasible and rarely done in practice [4]. We note that the class of models for which SGD
will obtain
n
K
consistency is larger than just generalized linear models that we use as an
example here [6].
Averaged stochastic gradient descent (ASGD) is a variant of SGD where parameter
estimates are computed with SGD and then averaged. With an appropriate step size,
parameters fit by ASDG have been shown to achieve the same variance in a single pass
as those fit by directly optimizing the objective function [10,20]. ASGD is much simpler
to implement than second-order SGD but has not been popular in practice. This may be
because it takes a very large sample size to reach the asymptotic regime [20].
Some variants of SGD allow for a step size for each parameter, such as SGD-QN [3],
Adagrad [7], and Adadelta [21], which tend to work well in practice. For information
about other variants and implementation details, see [5]. We provide a simple concrete
implementation of SGD in Section 23.5.2.
Despite the drawbacks, if
¯
Q
0
and G
0
can be well approximated by GLMs, SGD-based
optimization routines are a good way to compute the estimates in one pass.
23.5 Example and Simulation
23.5.1 Data-Generating Distribution
We evaluate the statistical performance of the OLOS estimator and discuss practical
implementation details in the context of a simulation study. For each observation, W is
generated by making p = 2000 independent draws from a uniform distribution on [1, 1].
Given W , A is drawn from a Bernoulli distribution with success probability
1
1+exp(0.75Z)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset