14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Stochastic Gradient Methods for Principled

Estimation with Large Datasets

Panos Toulis and Edoardo M. Airoldi

CONTENTS

14.1 Introduction .................................................................... 241

14.1.1 Outline ................................................................. 245

14.2 Stochastic Approximation ..................................................... 245

14.2.1 Sakrison’s Recursive Estimation Method ............................. 246

14.3 Estimation with Stochastic Gradient Methods ................................ 247

14.3.1 Bias and Variance ..................................................... 247

14.3.2 Stability ................................................................ 248

14.3.3 Choice of Learning Rate Sequence .................................... 249

14.3.4 Eﬃcient Computation of Implicit Methods ........................... 250

14.3.5 Extensions ............................................................. 251

14.3.5.1 Second-Order Methods ..................................... 251

14.3.5.2 Averaging ................................................... 252

14.3.5.3 Monte Carlo Stochastic Gradient Descent ................. 253

14.4 Other Applications ............................................................. 254

14.4.1 Online EM Algorithm ................................................. 254

14.4.2 MCMC Sampling ...................................................... 256

14.4.3 Reinforcement Learning ............................................... 257

14.4.4 Deep Learning ......................................................... 258

14.5 Glossary ........................................................................ 259

References ............................................................................. 259

14.1 Introduction

Parameter estimation by optimization of an objective function, such as maximum-likelihood

and maximum a posteriori, is a fundamental idea in statistics and machine learning (Fisher,

1922, Lehmann and Casella, 2003, Hastie et al., 2011). However, widely used optimization-

based estimation algorithms, such as Fisher scoring, the Expectation-Maximization (EM)

algorithm, and iteratively reweighted least squares (Fisher, 1925, Dempster et al., 1977,

Green, 1984), are not scalable to modern datasets with hundreds of millions of data points

and hundreds or thousands of covariates (National Research Council, 2013).

To illustrate, let us consider the problem of estimating the true parameter value θ



∈

from an i.i.d. sample D = {X

},forn =1, 2,...,N; X

∈ R

is the covariate

241

242 Handbook of Big Data

vector and Y

∈ R

is the outcome distributed conditionally on X

according to the known

distribution f and unknown model parameters θ



∼ f(·; X

, θ



)

We assume that the data points (X

) are observed in sequence (streaming data). The

log-likelihood, log f (Y ; X, θ), as a function of the parameter value θ given a data point

(X, Y ), will be denoted by (θ; Y, X); for brevity, we deﬁne (θ; D)=



n=1

(θ; X

)as

the complete data log-likelihood.

Traditional estimation methods are typically iterative and have a running time com-

plexity that ranges between O(Np

)andO(Np) in worst cases and best cases, respectively.

Newton–Raphson methods, for instance, update an estimate θ

n−1

of the parameters through

the recursion

= θ

n−1

− H

−1

n−1

∇(θ

n−1

; D) (14.1)

where H

= ∇∇(θ

; D)isthep × p Hessian matrix of the complete data log-likelihood.

The matrix inversion and the likelihood computation over the dataset D imply complexity

O(Np

2+

), which makes the algorithm unsuited for estimation with large datasets. Fisher

scoring replaces the Hessian matrix in Equation 14.1 with its expected value over a data

point (X

), that is, it uses the Fisher information matrix I(θ)=−E (∇∇(θ; X

)).

The advantage of this method is that a steady increase in the likelihood is possible because

the diﬀerence

(θ + Δθ; D) − (θ; D) ≈  (θ; D)



I(θ)

−1

(θ; D)+O(

)

can be made positive for an appropriately small value  > 0, because I(θ) is positive deﬁnite.

However, Fisher scoring is computationally comparable to Newton–Raphson’s, and thus it is

also unsuited for estimation with large datasets. Other general estimation algorithms, such

as EM or iteratively reweighted least squares (Green, 1984), have similar computational

constraints.

Quasi-Newton methods are a powerful alternative that is widely used in practice. In

quasi-Newton methods, the Hessian in the Newton–Raphson algorithm is approximated by

a low-rank matrix that is updated at each iteration as new values of the gradient become

available. This yields an algorithm with complexity O(Np

)orO(Np) in certain favorable

cases (Hennig and Kiefel, 2013).

However, estimation with large datasets requires complexity that scales linearly with

N, the number of data points, but sublinearly with p, the number of parameters. The ﬁrst

requirement on N seems hard to overcome, because each data point carries information

for θ



by the i.i.d. data assumption. Therefore, gracious scaling with p is necessary.

Such computational requirements have recently sparked interest in procedures that

utilize only ﬁrst-order information, that is, methods that utilize only the gradient function.

A prominent procedure that ﬁts this description is stochastic gradient descent (SGD),

deﬁned through the iteration

sgd

= θ

sgd

n−1

+ a

∇(θ

sgd

n−1

; X

) (14.2)

We will refer to Equation 14.2 as SGD with explicit updates or explicit SGD for short,

because the next iterate θ

sgd

can be computed immediately after the new data point

) is observed. The sequence a

> 0isknownasthelearning rate sequence, typically

deﬁned such that na

→ α > 0, as n →∞. The parameter α > 0isthelearning rate

parameter, and it is crucial for the convergence and stability of explicit SGD.

From a computational perspective, the SGD procedure (Equation 14.2) is appealing

because the expensive inversion of p ×p matrices, as in Newton–Raphson, is replaced by a

Stochastic Gradient Methods for Principled Estimation with Large Datasets 243

single sequence of scalars a

> 0. Furthermore, the log-likelihood is evaluated at a single

data point (X

), and not on the entire dataset D.

From a theoretical perspective, the explicit SGD procedure is justiﬁed because

Equation 14.2 is a special case of the stochastic approximation method of Robbins and

Monro (1951). By the theory of stochastic approximation, explicit SGD converges to a

point θ

∞

that satisﬁes E (∇(θ

∞

; X, Y)) = 0; under typical regularity conditions, θ

∞

exactly the true parameter value θ



. As a recursive statistical estimation method, explicit

SGD was ﬁrst proposed by Sakrison (1965) in a simple second-order form, that is, using

the Fisher information matrix in iteration (Equation 14.2); the simplicity of SGD has also

made it very popular in optimization and machine learning with large datasets (Le Cun

and Bottou, 2004, Zhang, 2004, Spall, 2005).

∗

However, the remarkable simplicity of explicit SGD comes at a price, as the SGD

procedure requires careful tuning of the learning rate parameter α. For small values of

α, the iterates θ

sgd

will converge very slowly to θ



(large bias), whereas for large values of

α, the iterates θ

sgd

will either have a large asymptotic variance (with respect to random

data D) or even diverge numerically. In large datasets with many parameters (large p),

the balance between bias, variance, and stability is very delicate, and nearly impossible to

achieve without appropriately modifying Equation 14.2.

Interestingly, the simple modiﬁcation of explicit SGD deﬁned through the iteration

= θ

n−1

+ a

∇(θ

; X

) (14.3)

can resolve its stability issue virtually at no cost. We will refer to Equation 14.3 as implicit

stochastic gradient descent or implicit SGD for short (Toulis et al., 2014, Toulis and Airoldi,

2015b). Equation 14.3 is implicit because the next iterate θ

appears on both sides of the

equation. This equation is a p-dimensional ﬁxed-point equation, which is generally hard to

solve. However, for a large family of statistical models, it can be reduced to a one-dimensional

ﬁxed-point equation. We discuss computational issues of implicit SGD in Section 14.3.4.

The ﬁrst intuition for implicit SGD is obtained using a Taylor expansion of the implicit

update (Equation 14.3). In particular, assuming a common point θ

sgd

n−1

= θ

n−1

= θ,aTaylor

expansion of Equation 14.3 around θ

n−1

implies

Δθ



I + a

I(θ; X

)



−1

Δθ

sgd

+ O(a

) (14.4)

where

Δθ

= θ

− θ

n−1

for both explicit and implicit methods

I(θ; X

)=−∇∇(θ; X

) is the observed Fisher information matrix

I is the p × p identity matrix

Thus, implicit SGD uses updates that are a shrinked version of the explicit ones. The

shrinkage factor in Equation 14.4 depends on the observed information up to the nth data

point and is similar to shrinkage in ridge regression.

Naturally, implicit SGD has also a Bayesian interpretation. In particular, if the log-

likelihood is continuously diﬀerentiable, then the update in Equation 14.3 is equivalent to

the update

=argmax

θ∈R



(θ; X

) −

||θ − θ

n−1



(14.5)

∗

Recursive estimation methods using stochastic approximation were originally developed for problems

with streaming data. However, these methods are more broadly applicable to estimation with a static

dataset. Asymptotically, these two scenarios are equivalent, with the estimates converging to the true

parameter value θ



. Estimates with a static dataset (i.e., with a ﬁnite sample) converge instead to the point

that minimizes some predeﬁned empirical loss, for example, based on the likelihood.

244 Handbook of Big Data

The iterate θ

from Equation 14.5 is the posterior mode of the following Bayesian model:

θ|θ

n−1

∼N(θ

n−1

, θ ∼ f(·; X

, θ) (14.6)

where N denotes the normal distribution. Therefore, the learning rate a

relates to the

information received after n data points have been observed, and encodes our trust on

the current estimate θ

n−1

. The Bayesian formulation (Equation 14.6) demonstrates the

ﬂexibility of implicit SGD. For example, depending on the parameter space of θ



,the

Bayesian model in Equation 14.6 could be diﬀerent; for instance, if θ



was a scale parameter,

then the normal distribution could be replaced by an inverse chi-squared distribution.

Furthermore, instead of a

I as the prior variance, it would be statistically eﬃcient to use the

Fisher information matrix (1/n)I(θ

n−1

)

−1

, completely analogous to Sakrison’s method—we

discuss these ideas in Section 14.3.5.1.

There is also a tight connection of Equation 14.5 to proximal methods in optimization.

For example, if we replaced the stochastic component (θ; X

) in Equation 14.5 with

the complete data log-likelihood (θ; D), then Equation 14.5 would be essentially the

proximal point algorithm of Rockafellar (1976) that applies to deterministic settings. This

algorithm is known for its numerical stability, and has been generalized through the idea

of splitting algorithms (Lions and Mercier, 1979); see Parikh and Boyd (2013) for a

comprehensive review. The convergence of proximal methods with a stochastic component,

as in Equation 14.5, has been analyzed recently—under various forms and assumptions—

by Bertsekas (2011), Ryu and Boyd (2014), and Rosasco et al. (2014). From a statistical

perspective, Toulis and Airoldi (2015a) derived the asymptotic variance of θ

sgd

and θ

estimators of θ



, and provided an algorithm to eﬃciently compute Equation 14.5 for the

family of generalized linear models—we show a generalization of this result in Section 14.3.4.

In the online learning literature, regret analyses of implicit methods have been given by

Kivinen et al. (2006) and Kulis and Bartlett (2010). Further intuitions for proximal methods

(Equation 14.5) have been given by Krakowski et al. (2007) and Nemirovski et al. (2009),

who showed that proximal methods can ﬁt better in the geometry of the parameter space.

Finally, (Toulis et al., 2015) derived a stochastic approximation framework that encompasses

stochastic proximal methods, and showed that it retains the asymptotic properties of the

framework of Robbins and Monro, but is signiﬁcantly more robust in the non-asymptotic

regime.

Arguably, the normalized least mean squares (NLMS) ﬁlter (Nagumo and Noda, 1967)

was the ﬁrst statistical model that used an implicit update as in Equation 14.3 and was

shown to be robust to input noise (Slock, 1993). Two other recent stochastic proximal meth-

ods are Prox-SVRG (Xiao and Zhang, 2014) and Prox-SAG (Schmidt et al., 2013, section 6).

The main idea in both methods is to replace the gradient in Equation 14.5 with an estimate

of the full gradient averaged over all data points that has the same expectation with the

gradient of Equation 14.3 but smaller variance. Because of their operational complexity, we

will not discuss these methods further. Instead, in Section 14.3.5.1, we will discuss a related

proximal method, namely AdaGrad (Duchi et al., 2011), that maintains one learning rate for

each parameter component and updates these learning rates as new data points are observed.

Example 14.1 Consider the linear normal model, Y

∼N(X



, 1). The log-likelihood

for this model is (θ; X

)=−

− X



)

. Therefore, the explicit SGD procedure

will be

sgd

= θ

sgd

n−1

+ a

− X



sgd

n−1

=(I − a



)θ

sgd

n−1

+ a

(14.7)

Equation 14.7 is known as the least mean squares ﬁlter (LMS) in signal processing, or as the

Widrow–Hoﬀ algorithm (Widrow and Hoﬀ, 1960), and it is a special case of explicit SGD.

Stochastic Gradient Methods for Principled Estimation with Large Datasets 245

The stability problems of explicit SGD become apparent by inspection of Equation 14.7; a

misspeciﬁcation of a

can lead to a poor next iterate θ

sgd

, for example, when I − a



has large negative eigenvalues—we discuss these issues in Section 14.3.2.

TheimplicitSGDprocedureforthelinearmodelis

= θ

n−1

+ a

− X



⇒

=(I + a



)

−1

n−1

+ a

(I + a



)

−1

(14.8)

Equation 14.8 is known as the NLMS ﬁlter in signal processing (Nagumo and Noda, 1967).

In contrast to explicit SGD, the implicit iterate θ

is a weighted average between the

previous iterate θ

n−1

and the new observation Y

, which is now stable to misspeciﬁcations

of the learning rate a

14.1.1 Outline

The structure of this chapter is as follows. In Section 14.2, we give an overview of the

Robbins–Monro procedure and Sakrison’s recursive estimation method, which provide the

theoretical basis for SGD methods. In Section 14.3, we introduce a simple generalization

of explicit and implicit SGD, and we analyze them as statistical estimation procedures

for the model parameters θ



after n data points have been observed. In Section 14.3.1,

we give results on the frequentist statistical properties of SGD estimators, that is, their

asymptotic bias and variance across multiple realizations of the dataset D. We then leverage

these results to study optimal learning rate sequences a

(Section 14.3.3), the loss of

statistical information in SGD, and numerical stability (Section 14.3.2). In Section 14.3.5,

we illustrate three extensions of the SGD methods, in particular: (1) second-order SGD

methods (Section 14.3.5.1), which adaptively approximate the Fisher information matrix;

(2) averaged SGD methods, which use larger learning rates together with averaging of the

iterates; and (3) Monte Carlo SGD methods, which can be applied when the log-likelihood

cannot be eﬃciently computed. In Section 14.4, we review applications of SGD in statistics

and machine learning, namely, online EM, Markov chain Monte Carlo (MCMC) posterior

sampling, reinforcement learning, and deep learning.

14.2 Stochastic Approximation

Consider a random variable H(θ) that depends on parameter θ; for simplicity, assume that

H(θ)andθ are real numbers. The regression function, h(θ)=E (H(θ)), is decreasing but

possibly unknown. Robbins and Monro (1951) considered the problem of ﬁnding the unique

point θ



for which h(θ



) = 0. They devised a procedure, known as the Robbins–Monro

procedure, in which an estimate θ

n−1

of θ



is utilized to sample one new data point H(θ

n−1

);

by deﬁnition, E (H(θ

n−1

)|θ

n−1

)=h(θ

n−1

). The estimate is then updated according to the

following rule:

= θ

n−1

+ a

H(θ

n−1

) (14.9)

The scalar a

> 0isthelearning rate and should decay to zero, but not too fast to guarantee

convergence. Robbins and Monro (1951) proved that E



(θ

− θ



)



→ 0, if

a. E



H(θ)



θ) < ∞, for any θ,and



∞

i=1

= ∞ and



∞

i=1

< ∞.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (1/6)