23. Online Estimation of the Average Treatment Effect (2/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

434 Handbook of Big Data

where Z is the sum of the ﬁrst four components of W . Finally, Y is drawn from a Bernoulli

distribution with success probability

1+exp(1+0.5Z − 0.3A)

The ﬁrst four components of W are related to both A and Y so they are confounders. In

general, failing to adjust for confounding variables will result in a biased estimate of the

ATE. The other components of W are not counfounders and just add noise to the data. The

true parameter ψ

is approximately 0.0602. A naive estimate of ψ

that fails to adjust for

W is approximately −0.026. In this simulation, confounding is strong enough that a naive

estimate of ψ

that does not adjust for W will be quite biased.

23.5.2 SGD for Logistic Regression

We estimate both G

and

with main term logistic regression with parameters computed

with minibatch gradient descent.

We estimate G

with a logistic regression model with main terms for each component of

W ,and

with a logistic regression model with main terms for A and each component of

W . To investigate the double robustness of the estimator, we misspecify models for G

and

by leaving out W . The parameters for both

and G

are computed with minibatch

gradient descent.

For concreteness, we describe the SGD algorithm for logistic regression to estimate

. Estimating G

is analogous. Let X

be a d-dimensional vector of predictor variables

including a constant, A

, and when the correctly speciﬁed, W

.Letθ be a vector of the

same dimension of X

. For a particular θ,

Q(A

) is computed as

1+exp(−θ



)

Initializing θ

to a vector of 0s, θ

at minibatch j is computed as

= θ

j−1

− η



i=n

j−1

−



−

1+exp(−θ



j−1

)



where η

is the step size.

We use minibatches of size 100 and step size

0.05

1+j0.05α

for minibatch j where α is 0.01 for

and 0.1forG

23.5.3 Results

We constructed datasets of size n =10

. Using double precision ﬂoating point numbers for

storage, a single dataset is more than 15 GB uncompressed, so computing a batch estimate

of ψ

on a typical personal computer is non-trivial. For each simulated dataset, the OLOS

estimator is run using either correctly or incorrectly speciﬁed initial estimators of

and

. The current estimate ψ

is recorded at each minibatch j.

Online Estimation of the Average Treatment Eﬀect 435

To investigate the performance of the estimator as a function of sample size, we

compute the observed bias and variance at each minibatch across 1000 simulated datasets.

Figure 23.1 shows the bias as a function of sample size scaled by the square root of the

accumulated sample size.

For an estimator to be asymptotically linear, the bias should be o

(1/

√

n). This means

that we want to see that the scaled bias converge to zero as n increases. We see that in all

cases, bias is quite high when sample size is small. This is because estimates of

and G

are far from the truth for the ﬁrst few hundred minibatches. When only one of

or G

is correctly speciﬁed, bias moves toward zero slowly, because we are relying on the double

robustness of the OLOS. When both

and G

are estimated by a correctly speciﬁed

model, the scaled bias approaches 0 reasonably quickly.

Bias is high in early minibatches and it takes many steps to recover. To avoid some of

the initial bias, it is sometimes helpful to discard estimates of ψ

in early minibatches and

compute the ﬁnal estimate of ψ

K − K



j=K

j−1

where K

is the number of discarded minibatches. This allows estimates of

and G

get warmed up with the ﬁrst K

minibatches, and in experiments we see this can reduce

bias substantially. (Results not shown.)

Figure 23.2 plots the variance of the OLOS estimates scaled by accumulated sample

size. The red line denotes the variance of the eﬃcient inﬂuence curve at P

,whichis

approximately 0.95 in this simulation. If an estimator is eﬃcient, its variance scaled by the

sample size will approach the variance of the eﬃcient inﬂuence curve as n →∞. As expected,

we see that when both

and G

are consistently estimated, the OLOS estimator is eﬃcient.

When only one of

or G

is consistently estimated, the OLOS estimator is not eﬃcient

in general, but in this simulation we see that the variance is close to the eﬃciency bound.

Model

−10

−8

−6

√

n × bias

−4

−2

5.0 × 10

1.0 × 10

−

Q misspec, G correct

−

Q correct, G misspec

−

Q correct, G correct

FIGURE 23.1

Bias scaled by

√

n by sample size.

436 Handbook of Big Data

Model

0.7

0.8

0.9

n × variance

1.0

1.1

1.2

5.0 × 10

1.0 × 10

−

Q misspec, G correct

−

Q correct, G misspec

−

Q correct, G correct

FIGURE 23.2

Variance scaled by n by sample size with a horizontal line at the optimal asymptotic

variance.

23.6 Discussion

The OLOS estimator gives us a way to estimate the ATE in a single pass through a dataset,

getting running estimates along the way. Although bias can be high when a relatively small

amount of data is processed, asymptotically we can do as well as we would with a batch

estimator that typically requires many passes through the dataset.

In the example in Section 23.5, we used main term logistic regression to estimate both

and G

. This can easily be extended to any user-speciﬁed basis function of W . However,

generalized linear models will not include the true

and G

in general, and more ﬂexible

online estimators are needed.

SGD-based optimization is a natural way to turn batch algorithms into online algorithms

and is applicable to a wide range of estimation methods. One such class of estimators

is multilayer neural networks, which are incredibly ﬂexible. Typically, because of the

computational complexity of neural networks, they are ﬁt with SGD in the batch setting

with multiple passes through the dataset. In principle, one could ﬁt a neural network as an

online estimator, though a single pass through even very large datasets may not be enough

for the neural network to reach the asymptotic regime.

There are other examples in the literature of estimators that have been extended to

the online setting that are not based on SGD. Examples include bagging and boosting [9],

random forests [1,13], and generalized additive models [19].

It is not clear when one estimator should be chosen over another. There are also typically

some choices of tuning parameters, for example level of regularization. For online methods,

ﬁtting procedures may have their own set of choices. For example, there are a number of

variants of SGD for optimization, each of which has some tuning parameters. In the batch

Online Estimation of the Average Treatment Eﬀect 437

setting, cross-validation can be used to select between many estimators, or a model stacking

approach such as the super learner algorithm [17] can be used to choose a combination of

estimators. A similar approach can be taken in an online setting by using each new minibatch

as an independent validation set to estimate out-of-sample performance before updating an

estimator with that minibatch. In this way, many estimators with diﬀerent choices of tuning

parameters can be ﬁt concurrently, and the best or a combination can be selected based on

the estimated out-of-sample performance.

References

1. Hanady Abdulsalam, David B. Skillicorn, and Patrick Martin. Streaming random

forests. In Database Engineering and Applications Symposium, IDEAS 2007. 11th

International, pp. 225–232. IEEE, Alberta, Canada, 2007.

2. Peter J. Bickel, Chris A.J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. Eﬃcient

and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press,

Baltimore, MD, 1993.

3. Antoine Bordes, L´eon Bottou, and Patrick Gallinari. SGD-QN: Careful quasi-Newton

stochastic gradient descent. The Journal of Machine Learning Research, 10:1737–1754,

2009.

4. L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In

Proceedings of COMPSTAT, pp. 177–186. Springer, Berlin, Germany, 2010.

5. L´eon Bottou. Stochastic gradient descent tricks. In Gr´egoire Montavon; Genevi`eve B.

Orr; Klaus-Robert M¨uller, editors, Neural Networks: Tricks of the Trade, pp. 421–436.

Springer, Berlin, Germany, 2012.

6. L´eon Bottou and Yann L. Cun. On-line learning for very large data sets. Applied

Stochastic Models in Business and Industry, 21(2):137–151, 2005.

7. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online

learning and stochastic optimization. The Journal of Machine Learning Research,

12:2121–2159, 2011.

8. Noboru Murata. A Statistical Study of On-Line Learning: Online Learning and Neural

Networks. Cambridge University Press, Cambridge, 1998.

9. Nikunj C. Oza. Online bagging and boosting. In IEEE International Conference on

Systems, Man and Cybernetics, volume 3, pp. 2340–2345. IEEE, 2005.

10. Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by

averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

11. James M. Robins, Andrea Rotnitzky, and Lue P. Zhao. Estimation of regression

coeﬃcients when some regressors are not always observed. Journal of the American

Statistical Association, 89(427):846–866, 1994.

12. Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in

observational studies for causal eﬀects. Biometrika, 70(1):41–55, 1983.

438 Handbook of Big Data

13. Amir Saﬀari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof.

On-line random forests. In IEEE 12th International Conference on Computer Vision

Workshops, pp. 1393–1400. IEEE, Kyoto, Japan, 2009.

14. Janet M. Box-Steﬀensmeier, Henry E. Brady, and David Collier. The Oxford Handbook

of Political Methodology, Oxford University Press, Oxford, New York, 2008.

15. Mark J. van der Laan. Targeted maximum likelihood based causal inference: Part I.

The International Journal of Biostatistics, 6(2), 2010.

16. Mark J. van der Laan and Samuel D. Lendle. Online Targeted Learning. Working Paper

330, U.C. Berkeley Division of Biostatistics, University of California, Berkeley, CA,

September 2014.

17. Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard. Super learner. Statistical

Applications in Genetics and Molecular Biology, 6(1):1–23, 2007.

18. Mark J. van der Laan and Sherri Rose. Targeted Learning: Causal Inference for

Observational and Experimental Data. Springer-Verlag, New York, 2011.

19. Simon N. Wood, Yannig Goude, and Simon Shaw. Generalized additive models for

large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics),

139–155, 2014.

20. Wei Xu. Towards optimal one pass large scale learning with averaged stochastic gradient

descent. CoRR, abs/1107.2490, 2011.

21. Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint

arXiv:1212.5701, 2012.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 23. Online Estimation of the Average Treatment Effect (2/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
23. Online Estimation of the Average Treatment Effect (2/2)