244 Handbook of Big Data
The iterate θ
im
n
from Equation 14.5 is the posterior mode of the following Bayesian model:
θ|θ
im
n−1
∼N(θ
im
n−1
,a
n
I)
Y
n
|X
n
, θ ∼ f(·; X
n
, θ) (14.6)
where N denotes the normal distribution. Therefore, the learning rate a
n
relates to the
information received after n data points have been observed, and encodes our trust on
the current estimate θ
im
n−1
. The Bayesian formulation (Equation 14.6) demonstrates the
flexibility of implicit SGD. For example, depending on the parameter space of θ
,the
Bayesian model in Equation 14.6 could be different; for instance, if θ
was a scale parameter,
then the normal distribution could be replaced by an inverse chi-squared distribution.
Furthermore, instead of a
n
I as the prior variance, it would be statistically efficient to use the
Fisher information matrix (1/n)I(θ
im
n−1
)
−1
, completely analogous to Sakrison’s method—we
discuss these ideas in Section 14.3.5.1.
There is also a tight connection of Equation 14.5 to proximal methods in optimization.
For example, if we replaced the stochastic component (θ; X
n
,Y
n
) in Equation 14.5 with
the complete data log-likelihood (θ; D), then Equation 14.5 would be essentially the
proximal point algorithm of Rockafellar (1976) that applies to deterministic settings. This
algorithm is known for its numerical stability, and has been generalized through the idea
of splitting algorithms (Lions and Mercier, 1979); see Parikh and Boyd (2013) for a
comprehensive review. The convergence of proximal methods with a stochastic component,
as in Equation 14.5, has been analyzed recently—under various forms and assumptions—
by Bertsekas (2011), Ryu and Boyd (2014), and Rosasco et al. (2014). From a statistical
perspective, Toulis and Airoldi (2015a) derived the asymptotic variance of θ
sgd
n
and θ
im
n
as
estimators of θ
, and provided an algorithm to efficiently compute Equation 14.5 for the
family of generalized linear models—we show a generalization of this result in Section 14.3.4.
In the online learning literature, regret analyses of implicit methods have been given by
Kivinen et al. (2006) and Kulis and Bartlett (2010). Further intuitions for proximal methods
(Equation 14.5) have been given by Krakowski et al. (2007) and Nemirovski et al. (2009),
who showed that proximal methods can fit better in the geometry of the parameter space.
Finally, (Toulis et al., 2015) derived a stochastic approximation framework that encompasses
stochastic proximal methods, and showed that it retains the asymptotic properties of the
framework of Robbins and Monro, but is significantly more robust in the non-asymptotic
regime.
Arguably, the normalized least mean squares (NLMS) filter (Nagumo and Noda, 1967)
was the first statistical model that used an implicit update as in Equation 14.3 and was
shown to be robust to input noise (Slock, 1993). Two other recent stochastic proximal meth-
ods are Prox-SVRG (Xiao and Zhang, 2014) and Prox-SAG (Schmidt et al., 2013, section 6).
The main idea in both methods is to replace the gradient in Equation 14.5 with an estimate
of the full gradient averaged over all data points that has the same expectation with the
gradient of Equation 14.3 but smaller variance. Because of their operational complexity, we
will not discuss these methods further. Instead, in Section 14.3.5.1, we will discuss a related
proximal method, namely AdaGrad (Duchi et al., 2011), that maintains one learning rate for
each parameter component and updates these learning rates as new data points are observed.
Example 14.1 Consider the linear normal model, Y
n
|X
n
∼N(X
n
θ
, 1). The log-likelihood
for this model is (θ; X
n
,Y
n
)=−
1
2
(Y
n
− X
n
θ
)
2
. Therefore, the explicit SGD procedure
will be
θ
sgd
n
= θ
sgd
n−1
+ a
n
(Y
n
− X
n
θ
sgd
n−1
)X
n
=(I − a
n
X
n
X
n
)θ
sgd
n−1
+ a
n
Y
n
X
n
(14.7)
Equation 14.7 is known as the least mean squares filter (LMS) in signal processing, or as the
Widrow–Hoff algorithm (Widrow and Hoff, 1960), and it is a special case of explicit SGD.