14
Stochastic Gradient Methods for Principled
Estimation with Large Datasets
Panos Toulis and Edoardo M. Airoldi
CONTENTS
14.1 Introduction .................................................................... 241
14.1.1 Outline ................................................................. 245
14.2 Stochastic Approximation ..................................................... 245
14.2.1 Sakrison’s Recursive Estimation Method ............................. 246
14.3 Estimation with Stochastic Gradient Methods ................................ 247
14.3.1 Bias and Variance ..................................................... 247
14.3.2 Stability ................................................................ 248
14.3.3 Choice of Learning Rate Sequence .................................... 249
14.3.4 Efficient Computation of Implicit Methods ........................... 250
14.3.5 Extensions ............................................................. 251
14.3.5.1 Second-Order Methods ..................................... 251
14.3.5.2 Averaging ................................................... 252
14.3.5.3 Monte Carlo Stochastic Gradient Descent ................. 253
14.4 Other Applications ............................................................. 254
14.4.1 Online EM Algorithm ................................................. 254
14.4.2 MCMC Sampling ...................................................... 256
14.4.3 Reinforcement Learning ............................................... 257
14.4.4 Deep Learning ......................................................... 258
14.5 Glossary ........................................................................ 259
References ............................................................................. 259
14.1 Introduction
Parameter estimation by optimization of an objective function, such as maximum-likelihood
and maximum a posteriori, is a fundamental idea in statistics and machine learning (Fisher,
1922, Lehmann and Casella, 2003, Hastie et al., 2011). However, widely used optimization-
based estimation algorithms, such as Fisher scoring, the Expectation-Maximization (EM)
algorithm, and iteratively reweighted least squares (Fisher, 1925, Dempster et al., 1977,
Green, 1984), are not scalable to modern datasets with hundreds of millions of data points
and hundreds or thousands of covariates (National Research Council, 2013).
To illustrate, let us consider the problem of estimating the true parameter value θ
R
p
from an i.i.d. sample D = {X
n
,Y
n
},forn =1, 2,...,N; X
n
R
p
is the covariate
241
242 Handbook of Big Data
vector and Y
n
R
d
is the outcome distributed conditionally on X
n
according to the known
distribution f and unknown model parameters θ
,
Y
n
|X
n
f(·; X
n
, θ
)
We assume that the data points (X
n
,Y
n
) are observed in sequence (streaming data). The
log-likelihood, log f (Y ; X, θ), as a function of the parameter value θ given a data point
(X, Y ), will be denoted by (θ; Y, X); for brevity, we define (θ; D)=
N
n=1
(θ; X
n
,Y
n
)as
the complete data log-likelihood.
Traditional estimation methods are typically iterative and have a running time com-
plexity that ranges between O(Np
3
)andO(Np) in worst cases and best cases, respectively.
Newton–Raphson methods, for instance, update an estimate θ
n1
of the parameters through
the recursion
θ
n
= θ
n1
H
1
n1
(θ
n1
; D) (14.1)
where H
n
= ∇∇(θ
n
; D)isthep × p Hessian matrix of the complete data log-likelihood.
The matrix inversion and the likelihood computation over the dataset D imply complexity
O(Np
2+
), which makes the algorithm unsuited for estimation with large datasets. Fisher
scoring replaces the Hessian matrix in Equation 14.1 with its expected value over a data
point (X
n
,Y
n
), that is, it uses the Fisher information matrix I(θ)=E (∇∇(θ; X
n
,Y
n
)).
The advantage of this method is that a steady increase in the likelihood is possible because
the difference
(θ + Δθ; D) (θ; D) (θ; D)
I(θ)
1
(θ; D)+O(
2
)
can be made positive for an appropriately small value > 0, because I(θ) is positive definite.
However, Fisher scoring is computationally comparable to Newton–Raphson’s, and thus it is
also unsuited for estimation with large datasets. Other general estimation algorithms, such
as EM or iteratively reweighted least squares (Green, 1984), have similar computational
constraints.
Quasi-Newton methods are a powerful alternative that is widely used in practice. In
quasi-Newton methods, the Hessian in the Newton–Raphson algorithm is approximated by
a low-rank matrix that is updated at each iteration as new values of the gradient become
available. This yields an algorithm with complexity O(Np
2
)orO(Np) in certain favorable
cases (Hennig and Kiefel, 2013).
However, estimation with large datasets requires complexity that scales linearly with
N, the number of data points, but sublinearly with p, the number of parameters. The first
requirement on N seems hard to overcome, because each data point carries information
for θ
by the i.i.d. data assumption. Therefore, gracious scaling with p is necessary.
Such computational requirements have recently sparked interest in procedures that
utilize only first-order information, that is, methods that utilize only the gradient function.
A prominent procedure that fits this description is stochastic gradient descent (SGD),
defined through the iteration
θ
sgd
n
= θ
sgd
n1
+ a
n
(θ
sgd
n1
; X
n
,Y
n
) (14.2)
We will refer to Equation 14.2 as SGD with explicit updates or explicit SGD for short,
because the next iterate θ
sgd
n
can be computed immediately after the new data point
(X
n
,Y
n
) is observed. The sequence a
n
> 0isknownasthelearning rate sequence, typically
defined such that na
n
α > 0, as n →∞. The parameter α > 0isthelearning rate
parameter, and it is crucial for the convergence and stability of explicit SGD.
From a computational perspective, the SGD procedure (Equation 14.2) is appealing
because the expensive inversion of p ×p matrices, as in Newton–Raphson, is replaced by a
Stochastic Gradient Methods for Principled Estimation with Large Datasets 243
single sequence of scalars a
n
> 0. Furthermore, the log-likelihood is evaluated at a single
data point (X
n
,Y
n
), and not on the entire dataset D.
From a theoretical perspective, the explicit SGD procedure is justified because
Equation 14.2 is a special case of the stochastic approximation method of Robbins and
Monro (1951). By the theory of stochastic approximation, explicit SGD converges to a
point θ
that satisfies E ((θ
; X, Y)) = 0; under typical regularity conditions, θ
is
exactly the true parameter value θ
. As a recursive statistical estimation method, explicit
SGD was first proposed by Sakrison (1965) in a simple second-order form, that is, using
the Fisher information matrix in iteration (Equation 14.2); the simplicity of SGD has also
made it very popular in optimization and machine learning with large datasets (Le Cun
and Bottou, 2004, Zhang, 2004, Spall, 2005).
However, the remarkable simplicity of explicit SGD comes at a price, as the SGD
procedure requires careful tuning of the learning rate parameter α. For small values of
α, the iterates θ
sgd
n
will converge very slowly to θ
(large bias), whereas for large values of
α, the iterates θ
sgd
n
will either have a large asymptotic variance (with respect to random
data D) or even diverge numerically. In large datasets with many parameters (large p),
the balance between bias, variance, and stability is very delicate, and nearly impossible to
achieve without appropriately modifying Equation 14.2.
Interestingly, the simple modification of explicit SGD defined through the iteration
θ
im
n
= θ
im
n1
+ a
n
(θ
im
n
; X
n
,Y
n
) (14.3)
can resolve its stability issue virtually at no cost. We will refer to Equation 14.3 as implicit
stochastic gradient descent or implicit SGD for short (Toulis et al., 2014, Toulis and Airoldi,
2015b). Equation 14.3 is implicit because the next iterate θ
im
n
appears on both sides of the
equation. This equation is a p-dimensional fixed-point equation, which is generally hard to
solve. However, for a large family of statistical models, it can be reduced to a one-dimensional
fixed-point equation. We discuss computational issues of implicit SGD in Section 14.3.4.
The first intuition for implicit SGD is obtained using a Taylor expansion of the implicit
update (Equation 14.3). In particular, assuming a common point θ
sgd
n1
= θ
im
n1
= θ,aTaylor
expansion of Equation 14.3 around θ
im
n1
implies
Δθ
im
n
=
I + a
n
ˆ
I(θ; X
n
,Y
n
)
1
Δθ
sgd
n
+ O(a
2
n
) (14.4)
where
Δθ
n
= θ
n
θ
n1
for both explicit and implicit methods
ˆ
I(θ; X
n
,Y
n
)=−∇∇(θ; X
n
,Y
n
) is the observed Fisher information matrix
I is the p × p identity matrix
Thus, implicit SGD uses updates that are a shrinked version of the explicit ones. The
shrinkage factor in Equation 14.4 depends on the observed information up to the nth data
point and is similar to shrinkage in ridge regression.
Naturally, implicit SGD has also a Bayesian interpretation. In particular, if the log-
likelihood is continuously differentiable, then the update in Equation 14.3 is equivalent to
the update
θ
im
n
=argmax
θR
p
a
n
(θ; X
n
,Y
n
)
1
2
||θ θ
im
n1
||
2
(14.5)
Recursive estimation methods using stochastic approximation were originally developed for problems
with streaming data. However, these methods are more broadly applicable to estimation with a static
dataset. Asymptotically, these two scenarios are equivalent, with the estimates converging to the true
parameter value θ
. Estimates with a static dataset (i.e., with a finite sample) converge instead to the point
that minimizes some predefined empirical loss, for example, based on the likelihood.
244 Handbook of Big Data
The iterate θ
im
n
from Equation 14.5 is the posterior mode of the following Bayesian model:
θ|θ
im
n1
∼N(θ
im
n1
,a
n
I)
Y
n
|X
n
, θ f(·; X
n
, θ) (14.6)
where N denotes the normal distribution. Therefore, the learning rate a
n
relates to the
information received after n data points have been observed, and encodes our trust on
the current estimate θ
im
n1
. The Bayesian formulation (Equation 14.6) demonstrates the
flexibility of implicit SGD. For example, depending on the parameter space of θ
,the
Bayesian model in Equation 14.6 could be different; for instance, if θ
was a scale parameter,
then the normal distribution could be replaced by an inverse chi-squared distribution.
Furthermore, instead of a
n
I as the prior variance, it would be statistically efficient to use the
Fisher information matrix (1/n)I(θ
im
n1
)
1
, completely analogous to Sakrison’s method—we
discuss these ideas in Section 14.3.5.1.
There is also a tight connection of Equation 14.5 to proximal methods in optimization.
For example, if we replaced the stochastic component (θ; X
n
,Y
n
) in Equation 14.5 with
the complete data log-likelihood (θ; D), then Equation 14.5 would be essentially the
proximal point algorithm of Rockafellar (1976) that applies to deterministic settings. This
algorithm is known for its numerical stability, and has been generalized through the idea
of splitting algorithms (Lions and Mercier, 1979); see Parikh and Boyd (2013) for a
comprehensive review. The convergence of proximal methods with a stochastic component,
as in Equation 14.5, has been analyzed recently—under various forms and assumptions—
by Bertsekas (2011), Ryu and Boyd (2014), and Rosasco et al. (2014). From a statistical
perspective, Toulis and Airoldi (2015a) derived the asymptotic variance of θ
sgd
n
and θ
im
n
as
estimators of θ
, and provided an algorithm to efficiently compute Equation 14.5 for the
family of generalized linear models—we show a generalization of this result in Section 14.3.4.
In the online learning literature, regret analyses of implicit methods have been given by
Kivinen et al. (2006) and Kulis and Bartlett (2010). Further intuitions for proximal methods
(Equation 14.5) have been given by Krakowski et al. (2007) and Nemirovski et al. (2009),
who showed that proximal methods can fit better in the geometry of the parameter space.
Finally, (Toulis et al., 2015) derived a stochastic approximation framework that encompasses
stochastic proximal methods, and showed that it retains the asymptotic properties of the
framework of Robbins and Monro, but is significantly more robust in the non-asymptotic
regime.
Arguably, the normalized least mean squares (NLMS) filter (Nagumo and Noda, 1967)
was the first statistical model that used an implicit update as in Equation 14.3 and was
shown to be robust to input noise (Slock, 1993). Two other recent stochastic proximal meth-
ods are Prox-SVRG (Xiao and Zhang, 2014) and Prox-SAG (Schmidt et al., 2013, section 6).
The main idea in both methods is to replace the gradient in Equation 14.5 with an estimate
of the full gradient averaged over all data points that has the same expectation with the
gradient of Equation 14.3 but smaller variance. Because of their operational complexity, we
will not discuss these methods further. Instead, in Section 14.3.5.1, we will discuss a related
proximal method, namely AdaGrad (Duchi et al., 2011), that maintains one learning rate for
each parameter component and updates these learning rates as new data points are observed.
Example 14.1 Consider the linear normal model, Y
n
|X
n
∼N(X
n
θ
, 1). The log-likelihood
for this model is (θ; X
n
,Y
n
)=
1
2
(Y
n
X
n
θ
)
2
. Therefore, the explicit SGD procedure
will be
θ
sgd
n
= θ
sgd
n1
+ a
n
(Y
n
X
n
θ
sgd
n1
)X
n
=(I a
n
X
n
X
n
)θ
sgd
n1
+ a
n
Y
n
X
n
(14.7)
Equation 14.7 is known as the least mean squares filter (LMS) in signal processing, or as the
Widrow–Hoff algorithm (Widrow and Hoff, 1960), and it is a special case of explicit SGD.
Stochastic Gradient Methods for Principled Estimation with Large Datasets 245
The stability problems of explicit SGD become apparent by inspection of Equation 14.7; a
misspecication of a
n
can lead to a poor next iterate θ
sgd
n
, for example, when I a
n
X
n
X
n
has large negative eigenvalues—we discuss these issues in Section 14.3.2.
TheimplicitSGDprocedureforthelinearmodelis
θ
im
n
= θ
im
n1
+ a
n
(Y
n
X
n
θ
im
n
)X
n
θ
im
n
=(I + a
n
X
n
X
n
)
1
θ
im
n1
+ a
n
(I + a
n
X
n
X
n
)
1
Y
n
X
n
(14.8)
Equation 14.8 is known as the NLMS filter in signal processing (Nagumo and Noda, 1967).
In contrast to explicit SGD, the implicit iterate θ
im
n
is a weighted average between the
previous iterate θ
im
n1
and the new observation Y
n
X
n
, which is now stable to misspecifications
of the learning rate a
n
.
14.1.1 Outline
The structure of this chapter is as follows. In Section 14.2, we give an overview of the
Robbins–Monro procedure and Sakrison’s recursive estimation method, which provide the
theoretical basis for SGD methods. In Section 14.3, we introduce a simple generalization
of explicit and implicit SGD, and we analyze them as statistical estimation procedures
for the model parameters θ
after n data points have been observed. In Section 14.3.1,
we give results on the frequentist statistical properties of SGD estimators, that is, their
asymptotic bias and variance across multiple realizations of the dataset D. We then leverage
these results to study optimal learning rate sequences a
n
(Section 14.3.3), the loss of
statistical information in SGD, and numerical stability (Section 14.3.2). In Section 14.3.5,
we illustrate three extensions of the SGD methods, in particular: (1) second-order SGD
methods (Section 14.3.5.1), which adaptively approximate the Fisher information matrix;
(2) averaged SGD methods, which use larger learning rates together with averaging of the
iterates; and (3) Monte Carlo SGD methods, which can be applied when the log-likelihood
cannot be efficiently computed. In Section 14.4, we review applications of SGD in statistics
and machine learning, namely, online EM, Markov chain Monte Carlo (MCMC) posterior
sampling, reinforcement learning, and deep learning.
14.2 Stochastic Approximation
Consider a random variable H(θ) that depends on parameter θ; for simplicity, assume that
H(θ)andθ are real numbers. The regression function, h(θ)=E (H(θ)), is decreasing but
possibly unknown. Robbins and Monro (1951) considered the problem of finding the unique
point θ
for which h(θ
) = 0. They devised a procedure, known as the Robbins–Monro
procedure, in which an estimate θ
n1
of θ
is utilized to sample one new data point H(θ
n1
);
by definition, E (H(θ
n1
)|θ
n1
)=h(θ
n1
). The estimate is then updated according to the
following rule:
θ
n
= θ
n1
+ a
n
H(θ
n1
) (14.9)
The scalar a
n
> 0isthelearning rate and should decay to zero, but not too fast to guarantee
convergence. Robbins and Monro (1951) proved that E
(θ
n
θ
)
2
0, if
a. E
H(θ)
2
θ) < , for any θ,and
b.
i=1
a
i
= and
i=1
a
2
i
< .
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset