434 Handbook of Big Data
where Z is the sum of the first four components of W . Finally, Y is drawn from a Bernoulli
distribution with success probability
1
1+exp(1+0.5Z 0.3A)
.
The first four components of W are related to both A and Y so they are confounders. In
general, failing to adjust for confounding variables will result in a biased estimate of the
ATE. The other components of W are not counfounders and just add noise to the data. The
true parameter ψ
0
is approximately 0.0602. A naive estimate of ψ
0
that fails to adjust for
W is approximately 0.026. In this simulation, confounding is strong enough that a naive
estimate of ψ
0
that does not adjust for W will be quite biased.
23.5.2 SGD for Logistic Regression
We estimate both G
0
and
¯
Q
0
with main term logistic regression with parameters computed
with minibatch gradient descent.
We estimate G
0
with a logistic regression model with main terms for each component of
W ,and
¯
Q
0
with a logistic regression model with main terms for A and each component of
W . To investigate the double robustness of the estimator, we misspecify models for G
0
and
¯
Q
0
by leaving out W . The parameters for both
¯
Q
0
and G
0
are computed with minibatch
gradient descent.
For concreteness, we describe the SGD algorithm for logistic regression to estimate
¯
Q
0
. Estimating G
0
is analogous. Let X
i
be a d-dimensional vector of predictor variables
including a constant, A
i
, and when the correctly specified, W
i
.Letθ be a vector of the
same dimension of X
i
. For a particular θ,
¯
Q(A
i
,W
i
) is computed as
1
1+exp(θ
X
i
)
.
Initializing θ
0
to a vector of 0s, θ
j
at minibatch j is computed as
θ
j
= θ
j1
η
j
1
m
n
j
i=n
j1
+1
Y
i
1
1+exp(θ
j1
X
i
)
X
i
where η
j
is the step size.
We use minibatches of size 100 and step size
η
j
=
0.05
1+j0.05α
for minibatch j where α is 0.01 for
¯
Q
0
and 0.1forG
0
.
23.5.3 Results
We constructed datasets of size n =10
6
. Using double precision floating point numbers for
storage, a single dataset is more than 15 GB uncompressed, so computing a batch estimate
of ψ
0
on a typical personal computer is non-trivial. For each simulated dataset, the OLOS
estimator is run using either correctly or incorrectly specified initial estimators of
¯
Q
0
and
G
0
. The current estimate ψ
n
j
is recorded at each minibatch j.
Online Estimation of the Average Treatment Effect 435
To investigate the performance of the estimator as a function of sample size, we
compute the observed bias and variance at each minibatch across 1000 simulated datasets.
Figure 23.1 shows the bias as a function of sample size scaled by the square root of the
accumulated sample size.
For an estimator to be asymptotically linear, the bias should be o
P
(1/
n). This means
that we want to see that the scaled bias converge to zero as n increases. We see that in all
cases, bias is quite high when sample size is small. This is because estimates of
¯
Q
0
and G
0
are far from the truth for the first few hundred minibatches. When only one of
¯
Q
0
or G
0
is correctly specified, bias moves toward zero slowly, because we are relying on the double
robustness of the OLOS. When both
¯
Q
0
and G
0
are estimated by a correctly specified
model, the scaled bias approaches 0 reasonably quickly.
Bias is high in early minibatches and it takes many steps to recover. To avoid some of
the initial bias, it is sometimes helpful to discard estimates of ψ
0
in early minibatches and
compute the final estimate of ψ
0
as
1
K K
0
K
j=K
0
+1
ψ
n
j1
:n
j
where K
0
is the number of discarded minibatches. This allows estimates of
¯
Q
0
and G
0
to
get warmed up with the first K
0
minibatches, and in experiments we see this can reduce
bias substantially. (Results not shown.)
Figure 23.2 plots the variance of the OLOS estimates scaled by accumulated sample
size. The red line denotes the variance of the efficient influence curve at P
0
,whichis
approximately 0.95 in this simulation. If an estimator is efficient, its variance scaled by the
sample size will approach the variance of the efficient influence curve as n →∞. As expected,
we see that when both
¯
Q
0
and G
0
are consistently estimated, the OLOS estimator is efficient.
When only one of
¯
Q
0
or G
0
is consistently estimated, the OLOS estimator is not efficient
in general, but in this simulation we see that the variance is close to the efficiency bound.
Model
0
10
8
6
n × bias
4
2
0
5.0 × 10
5
1.0 × 10
6
Q misspec, G correct
Q correct, G misspec
Q correct, G correct
FIGURE 23.1
Bias scaled by
n by sample size.
436 Handbook of Big Data
Model
0
0.7
0.8
0.9
n × variance
1.0
1.1
1.2
5.0 × 10
5
1.0 × 10
6
Q misspec, G correct
Q correct, G misspec
Q correct, G correct
FIGURE 23.2
Variance scaled by n by sample size with a horizontal line at the optimal asymptotic
variance.
23.6 Discussion
The OLOS estimator gives us a way to estimate the ATE in a single pass through a dataset,
getting running estimates along the way. Although bias can be high when a relatively small
amount of data is processed, asymptotically we can do as well as we would with a batch
estimator that typically requires many passes through the dataset.
In the example in Section 23.5, we used main term logistic regression to estimate both
¯
Q
0
and G
0
. This can easily be extended to any user-specified basis function of W . However,
generalized linear models will not include the true
¯
Q
0
and G
0
in general, and more flexible
online estimators are needed.
SGD-based optimization is a natural way to turn batch algorithms into online algorithms
and is applicable to a wide range of estimation methods. One such class of estimators
is multilayer neural networks, which are incredibly flexible. Typically, because of the
computational complexity of neural networks, they are fit with SGD in the batch setting
with multiple passes through the dataset. In principle, one could fit a neural network as an
online estimator, though a single pass through even very large datasets may not be enough
for the neural network to reach the asymptotic regime.
There are other examples in the literature of estimators that have been extended to
the online setting that are not based on SGD. Examples include bagging and boosting [9],
random forests [1,13], and generalized additive models [19].
It is not clear when one estimator should be chosen over another. There are also typically
some choices of tuning parameters, for example level of regularization. For online methods,
fitting procedures may have their own set of choices. For example, there are a number of
variants of SGD for optimization, each of which has some tuning parameters. In the batch
Online Estimation of the Average Treatment Effect 437
setting, cross-validation can be used to select between many estimators, or a model stacking
approach such as the super learner algorithm [17] can be used to choose a combination of
estimators. A similar approach can be taken in an online setting by using each new minibatch
as an independent validation set to estimate out-of-sample performance before updating an
estimator with that minibatch. In this way, many estimators with different choices of tuning
parameters can be fit concurrently, and the best or a combination can be selected based on
the estimated out-of-sample performance.
References
1. Hanady Abdulsalam, David B. Skillicorn, and Patrick Martin. Streaming random
forests. In Database Engineering and Applications Symposium, IDEAS 2007. 11th
International, pp. 225–232. IEEE, Alberta, Canada, 2007.
2. Peter J. Bickel, Chris A.J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. Efficient
and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press,
Baltimore, MD, 1993.
3. Antoine Bordes, eon Bottou, and Patrick Gallinari. SGD-QN: Careful quasi-Newton
stochastic gradient descent. The Journal of Machine Learning Research, 10:1737–1754,
2009.
4. L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT, pp. 177–186. Springer, Berlin, Germany, 2010.
5. L´eon Bottou. Stochastic gradient descent tricks. In Gr´egoire Montavon; Genevi`eve B.
Orr; Klaus-Robert M¨uller, editors, Neural Networks: Tricks of the Trade, pp. 421–436.
Springer, Berlin, Germany, 2012.
6. L´eon Bottou and Yann L. Cun. On-line learning for very large data sets. Applied
Stochastic Models in Business and Industry, 21(2):137–151, 2005.
7. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. The Journal of Machine Learning Research,
12:2121–2159, 2011.
8. Noboru Murata. A Statistical Study of On-Line Learning: Online Learning and Neural
Networks. Cambridge University Press, Cambridge, 1998.
9. Nikunj C. Oza. Online bagging and boosting. In IEEE International Conference on
Systems, Man and Cybernetics, volume 3, pp. 2340–2345. IEEE, 2005.
10. Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by
averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
11. James M. Robins, Andrea Rotnitzky, and Lue P. Zhao. Estimation of regression
coefficients when some regressors are not always observed. Journal of the American
Statistical Association, 89(427):846–866, 1994.
12. Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in
observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
438 Handbook of Big Data
13. Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof.
On-line random forests. In IEEE 12th International Conference on Computer Vision
Workshops, pp. 1393–1400. IEEE, Kyoto, Japan, 2009.
14. Janet M. Box-Steffensmeier, Henry E. Brady, and David Collier. The Oxford Handbook
of Political Methodology, Oxford University Press, Oxford, New York, 2008.
15. Mark J. van der Laan. Targeted maximum likelihood based causal inference: Part I.
The International Journal of Biostatistics, 6(2), 2010.
16. Mark J. van der Laan and Samuel D. Lendle. Online Targeted Learning. Working Paper
330, U.C. Berkeley Division of Biostatistics, University of California, Berkeley, CA,
September 2014.
17. Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard. Super learner. Statistical
Applications in Genetics and Molecular Biology, 6(1):1–23, 2007.
18. Mark J. van der Laan and Sherri Rose. Targeted Learning: Causal Inference for
Observational and Experimental Data. Springer-Verlag, New York, 2011.
19. Simon N. Wood, Yannig Goude, and Simon Shaw. Generalized additive models for
large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics),
139–155, 2014.
20. Wei Xu. Towards optimal one pass large scale learning with averaged stochastic gradient
descent. CoRR, abs/1107.2490, 2011.
21. Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset