23. Online Estimation of the Average Treatment Effect (1/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Online Estimation of the Average Treatment Eﬀect

Sam Lendle

CONTENTS

23.1 Introduction .................................................................... 429

23.2 Preliminaries ................................................................... 430

23.2.1 Causal Parameter and Assumptions .................................. 430

23.2.2 Statistical Model ....................................................... 430

23.3 Batch Estimation of the ATE .................................................. 431

23.4 Online One-Step Estimation of the ATE ...................................... 432

23.5 Example and Simulation ....................................................... 433

23.5.1 Data-Generating Distribution ......................................... 433

23.5.2 SGD for Logistic Regression ........................................... 434

23.5.3 Results ................................................................. 434

23.6 Discussion ...................................................................... 436

References ............................................................................. 437

23.1 Introduction

Drawing causal inferences from observational data requires making strong assumptions

about the causal process from which the data are generated, followed by a statistical analysis

of the observational dataset. Though we must make causal assumptions, we often know

little about the data-generating distribution. This means we generally cannot make strong

statistical assumptions so we estimate a statistical parameter in a nonparametric or semi-

parametric statistical model. Semiparametric eﬃcient estimators, that is, estimators that

achieve the minimum asymptotic variance bound, such as augmented inverse probability of

treatment weighted (A-IPTW) estimators [11] and targeted minimum loss-based estimators

(TMLE) [15,18], have been developed for a variety of statistical parameters with applications

in causal inference.

Typically, the computational eﬃciency and scalability of these estimators are not taken

into account. Borrowing language from the large-scale machine learning literature, we call

these batch estimators, because they process the entire dataset at one time. They rely

on estimation of one or more parts of the data-generating distribution. With traditional

statistical methods, estimating each of these parts may require many passes through the

data that can quickly become impractical as the size of datasets grow.

In this chapter, we demonstrate an online method for estimating the average treatment

eﬀect (ATE) that is doubly robust and statistically eﬃcient with only a single pass

through the dataset. In Section 23.2, we introduce the observed data structure, the

causal parameter, causal assumptions required for identiﬁcation of the parameter, and the

429

430 Handbook of Big Data

statistical parameter. In Section 23.3, we review a batch method for estimating the ATE.

In Section 23.4, we describe an online approach to estimating the ATE. In Section 23.5, we

demonstrate the performance of the online estimator in simulations. We conclude with a

discussion of extensions and future work in Section 23.6.

23.2 Preliminaries

23.2.1 Causal Parameter and Assumptions

We deﬁne the ATE using the counterfactual framework [14]. For a single observation, let

be the counterfactual value of some outcome had exposure A been set to level a for

a ∈{0, 1}. These values are called counterfactual because we can only observe a sample’s

outcome under the observed treatment that it received. Because at least one of Y

or Y

not observed, we can never calculate for a given observation the value Y

−Y

,whichcanbe

interpreted as the treatment eﬀect for that observation. Under some conditions, however,

we can estimate the average of this quantity, E

−Y

). This is known as the ATE, where

denotes expectation with respect to the true distribution of the counterfactual random

variables.

Before attempting to estimate the ATE, we ﬁrst consider the structure of our observed

dataset. Let O =(W, A, Y ) be an observed sample where W is a vector of covariates

measured before A, a binary exposure or treatment, and Y is the observed outcome. We

make the counterfactual consistency assumption that the observed outcome Y is equal to

the counterfactual under the observed treatment A. That is, we assume Y = Y

The ATE is a function of the distribution of counterfactuals Y

and Y

. To estimate

the ATE, we must be able to write it as a function of the distribution of the observed

data. When we can do this, the ATE is said to be identiﬁable. To do this, we need

to make some assumptions. The ﬁrst is the randomization assumption where we assume

A ⊥ (Y

) | W . That is, we assume that if there are any common causes of the exposure

A and outcome Y , they are measured and included in W . This is sometimes called the

no unmeasured confounders assumption. This is called an untestable assumption, because

it is not possible to test if this assumption is true using the observed data. Making the

randomization assumption requires expert domain knowledge and careful study design. We

also make the experimental treatment assignment assumption or positivity assumption: that

for any value of W , there is some possibility of either treatment being assigned. Formally,

we assume 0 <P

(A =1| W ) < 1 almost everywhere. Under these assumptions, we can

write the ATE as [12]

− Y

)=E

(Y | A =1,W) − E

(Y | A =0,W)]

23.2.2 Statistical Model

Now that we have posed some causal assumptions that allow us to write the ATE as a

parameter of the distribution of the observed data, we need to specify a statistical model

and target parameter. A statistical model M is a set of possible probability distributions

of the observed data.

Suppose that we observe a dataset of n independent and identically distributed

observations, O

,...,O

, with distribution P

. For a distribution P ∈M,letp = dP/dμ

Online Estimation of the Average Treatment Eﬀect 431

be the density of O with respect to some dominating measure μ. We can factorize the

density as

p(o)=Q

(w)G(a | w)Q

(y | a, w)

where:

is the marginal density of W

G is the conditional probability that A = a ,givenW

is the conditional density of Y ,givenA and W

Let Q =(Q

)and

Q(a, w)=E

(Y | A = a, W = w). We can parameterize the

model as M = {P : Q ∈Q,G ∈G}.

The randomization assumption puts no restriction on the distribution of the observed

data, and the positivity assumption only requires that G(1 | W ) be bounded away from

0 and 1. To ensure the true distribution P

is in M, we make no additional assumptions

on Q so Q is nonparametric. In some cases, we may know something about the treatment

mechanism G. For instance, we may know that the probability of treatment only depends

on a subset of covariates. In that case, we put some restriction on the set G in addition to

assuming 0 <G(1 | W ) < 0. In general, our model M is semiparametric.

We deﬁne the parameter mapping Ψ : M→R as

Ψ(P )=E

(Y | A =1,W) − E

(Y | A =0,W)]

where E

denotes expectation with respect to the distribution P .Letψ =Ψ(P )bethe

parameter mapping applied to distribution P . The target parameter we wish to estimate is

=Ψ(P

), the parameter mapping applied to the true distribution. We note that Ψ(P )

only depends on P through Q, so recognizing the abuse of notation, we sometimes write

Ψ(Q). Throughout, we will use subscript n to denote that a quantity is an estimate based

on n observations, and subscript 0 to denote the truth. For example

is an estimate of

, deﬁned as E

(Y | A = a, W = w ), where E

is expectation with respect to the true

distribution P

23.3 Batch Estimation of the ATE

An asymptotically linear estimator is one that can be written as a sum of some mean

zero function called an inﬂuence curve plus a small remainder. An eﬃcient estimator is

an estimator that achieves the minimum asymptotic variance among the class of regular

estimators. In particular, an eﬃcient estimator is asymptotically linear with the inﬂuence

curve equal to the eﬃcient inﬂuence curve, which depends on the particular parameter and

model. The asymptotic variance of an eﬃcient estimator is the variance of the eﬃcient

inﬂuence curve [2]. For our model M and parameter mapping Ψ, the eﬃcient inﬂuence

curve is given by

∗

(P )(O)=

2A − 1

G(A | W )

(Y −

Q(A, W )) +

Q(1,W) −

Q(0,W) − Ψ(Q).

To denote the dependence of D

∗

on P through Q and G, we sometimes also write D

∗

(Q, G).

There are many ways to construct an eﬃcient estimator, for example TMLE or A-IPTW.

We now review the A-IPTW estimator in the batch setting. An A-IPTW estimate is calcu-

lated as



i=1

− 1

| W

)

−

)) +

(1,W

) −

(0,W

)

432 Handbook of Big Data

where

and G

are estimates of Q

and G

, respectively. The A-IPTW estimator treats

∗

(P ) as an estimating function in ψ and nuisance parameters Q and G and solves for ψ.

The A-IPTW is also a one-step estimator, which starts with a plug in estimator for ψ

and

takes a step in the direction of the eﬃcient inﬂuence curve. That is,

=Ψ(Q



i=1

∗

)

where

Ψ(Q



i=1

(1,W

) −

(0,W

Under regularity conditions, the A-IPTW estimator is eﬃcient if both

and G

are

consistent for

and G

, respectively. Additionally, the A-IPTW estimator is doubly robust,

meaning that if either of

or G

is consistent, then ψ

is consistent.

23.4 Online One-Step Estimation of the ATE

The batch A-IPTW in Section 23.3 has some nice statistical properties. In particular, it is

eﬃcient and doubly robust. Our goal now is to construct an estimator of ψ

that has these

same properties, but we only want to make a single pass through the dataset. Additionally,

we only want to process a relatively small number of observations, a minibatch,atonetime.

Let 0 = n

< ···<n

= n.Heren = n

represents the total sample size, and

is the sample size accumulated up to minibatch j. Suppose n

− n

j−1

is bounded for

all j, and for simplicity let n

−n

j−1

= m be constant. Let O

denote the observations

,...,O

for i<j.

In Section 23.3, we computed an estimate of ψ

with estimates

and G

,whichwere

ﬁt on the full dataset. Now suppose we have estimates

j−1

and G

j−1

for

and G

respectively, that are based on observations O

j−1

. We will return to the problem of

computing

j−1

and G

j−1

later. Using those estimates of

and G

, we compute a new

estimate of ψ

on the next minibatch as

j−1



i=n

j−1



− 1

j−1

| W

)

−

j−1

))+

j−1

(1,W

) −

j−1

(0,W

)



That is, ψ

j−1

is a one-step estimator computed on minibatch j using initial estimates of

and G

from the previous minibatches. We compute the ﬁnal estimate of ψ

by taking

the mean of estimates ψ

j−1

from each minibatch. Let



j=1

j−1

and call this procedure the online one-step (OLOS) estimator.

Online Estimation of the Average Treatment Eﬀect 433

Under regularity conditions and if

and G

converge faster than rate n

−(1/4)

and

are both consistent for

and G

, then the OLOS estimator is asymptotically eﬃcient as

K →∞. Additionally, the OLOS estimator has the double robustness property, so if either

or G

is consistent, then ψ

is consistent [16, Theorem 1].

We now turn to estimating

and G

, both of which are conditional means. To be truly

scalable, ideally we want an estimation procedure that has a constant computational time

and storage per minibatch up to K, but we need that the estimates converge fast enough

as K →∞. This rules out estimates of

and G

that are ﬁt on data from one or a ﬁxed

number of minibatches. Stochastic gradient descent-based methods, however, can achieve

an appropriate rate of convergence in some circumstances [8,20].

Stochastic gradient descent (SGD) is an optimization procedure similar to traditional

batch gradient descent, where the gradient of the objective function for the whole dataset is

replaced by the gradient of the objective function at a single observation (or a minibatch).

The convergence rate to the empirical optimum of the objective function in terms of number

of iterations is very poor relative to batch gradient descent. However, a single iteration of

SGD or minibatch gradient descent takes constant time regardless of the size of the dataset,

while a single iteration of batch gradient descent takes O(n) time [4].

SGD can be used to ﬁt the parameters of generalized linear models (GLMs). Despite the

slow convergence rate, with an appropriately chosen step size, the parameters of a GLM ﬁt

with SGD can achieve

√

consistency in a single pass [8]. If curvature information is taken

into account, parameters ﬁt by so-called second-order SGD can achieve the same variance

as directly optimizing the empirical objective function [8], but this is often computationally

infeasible and rarely done in practice [4]. We note that the class of models for which SGD

will obtain

√

consistency is larger than just generalized linear models that we use as an

example here [6].

Averaged stochastic gradient descent (ASGD) is a variant of SGD where parameter

estimates are computed with SGD and then averaged. With an appropriate step size,

parameters ﬁt by ASDG have been shown to achieve the same variance in a single pass

as those ﬁt by directly optimizing the objective function [10,20]. ASGD is much simpler

to implement than second-order SGD but has not been popular in practice. This may be

because it takes a very large sample size to reach the asymptotic regime [20].

Some variants of SGD allow for a step size for each parameter, such as SGD-QN [3],

Adagrad [7], and Adadelta [21], which tend to work well in practice. For information

about other variants and implementation details, see [5]. We provide a simple concrete

implementation of SGD in Section 23.5.2.

Despite the drawbacks, if

and G

can be well approximated by GLMs, SGD-based

optimization routines are a good way to compute the estimates in one pass.

23.5 Example and Simulation

23.5.1 Data-Generating Distribution

We evaluate the statistical performance of the OLOS estimator and discuss practical

implementation details in the context of a simulation study. For each observation, W is

generated by making p = 2000 independent draws from a uniform distribution on [−1, 1].

Given W , A is drawn from a Bernoulli distribution with success probability

1+exp(−0.75Z)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 23. Online Estimation of the Average Treatment Effect (1/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
23. Online Estimation of the Average Treatment Effect (1/2)