Online Estimation of the Average Treatment Effect 433
Under regularity conditions and if
¯
Q
n
K
and G
n
K
converge faster than rate n
−(1/4)
K
and
are both consistent for
¯
Q
0
and G
0
, then the OLOS estimator is asymptotically efficient as
K →∞. Additionally, the OLOS estimator has the double robustness property, so if either
of
¯
Q
n
K
or G
n
K
is consistent, then ψ
n
K
is consistent [16, Theorem 1].
We now turn to estimating
¯
Q
0
and G
0
, both of which are conditional means. To be truly
scalable, ideally we want an estimation procedure that has a constant computational time
and storage per minibatch up to K, but we need that the estimates converge fast enough
as K →∞. This rules out estimates of
¯
Q
0
and G
0
that are fit on data from one or a fixed
number of minibatches. Stochastic gradient descent-based methods, however, can achieve
an appropriate rate of convergence in some circumstances [8,20].
Stochastic gradient descent (SGD) is an optimization procedure similar to traditional
batch gradient descent, where the gradient of the objective function for the whole dataset is
replaced by the gradient of the objective function at a single observation (or a minibatch).
The convergence rate to the empirical optimum of the objective function in terms of number
of iterations is very poor relative to batch gradient descent. However, a single iteration of
SGD or minibatch gradient descent takes constant time regardless of the size of the dataset,
while a single iteration of batch gradient descent takes O(n) time [4].
SGD can be used to fit the parameters of generalized linear models (GLMs). Despite the
slow convergence rate, with an appropriately chosen step size, the parameters of a GLM fit
with SGD can achieve
√
n
K
consistency in a single pass [8]. If curvature information is taken
into account, parameters fit by so-called second-order SGD can achieve the same variance
as directly optimizing the empirical objective function [8], but this is often computationally
infeasible and rarely done in practice [4]. We note that the class of models for which SGD
will obtain
√
n
K
consistency is larger than just generalized linear models that we use as an
example here [6].
Averaged stochastic gradient descent (ASGD) is a variant of SGD where parameter
estimates are computed with SGD and then averaged. With an appropriate step size,
parameters fit by ASDG have been shown to achieve the same variance in a single pass
as those fit by directly optimizing the objective function [10,20]. ASGD is much simpler
to implement than second-order SGD but has not been popular in practice. This may be
because it takes a very large sample size to reach the asymptotic regime [20].
Some variants of SGD allow for a step size for each parameter, such as SGD-QN [3],
Adagrad [7], and Adadelta [21], which tend to work well in practice. For information
about other variants and implementation details, see [5]. We provide a simple concrete
implementation of SGD in Section 23.5.2.
Despite the drawbacks, if
¯
Q
0
and G
0
can be well approximated by GLMs, SGD-based
optimization routines are a good way to compute the estimates in one pass.
23.5 Example and Simulation
23.5.1 Data-Generating Distribution
We evaluate the statistical performance of the OLOS estimator and discuss practical
implementation details in the context of a simulation study. For each observation, W is
generated by making p = 2000 independent draws from a uniform distribution on [−1, 1].
Given W , A is drawn from a Bernoulli distribution with success probability
1
1+exp(−0.75Z)