Stochastic Gradient Methods for Principled Estimation with Large Datasets 261
Fabian, V. (1973). Asymptotically efficient stochastic approximation: The RM case. Annals
of Statistics, 1:486–495.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical
Transactions of the Royal Society of London. Series A, Containing Papers of a Mathe-
matical or Physical Character, 222:309–368.
Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh,
Scotland.
Gardner, W. A. (1984). Learning characteristics of stochastic gradient descent algorithms:
A general study, analysis, and critique. Signal Processing, 6(2):113–133.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, (6):721–741.
George, A. P. and Powell, W. B. (2006). Adaptive stepsizes for recursive estimation with
applications in approximate dynamic programming. Machine Learning, 65(1):167–198.
Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian
Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 73(2):123–214.
Gosavi, A. (2009). Reinforcement learning: A tutorial survey and recent advances. IN-
FORMS Journal on Computing, 21(2):178–192.
Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation,
and some robust and resistant alternatives. Journal of the Royal Statistical Society.
Series B (Methodological ), 46:149–192.
Hastie, T., Tibshirani, R., and Friedman, J. (2011). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd edition. Springer, New York.
Hennig, P. and Kiefel, M. (2013). Quasi-newton methods: A new direction. The Journal of
Machine Learning Research, 14(1):843–865.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.
Neural Computation, 14(8):1771–1800.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational
inference. The Journal of Machine Learning Research, 14(1):1303–1347.
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive
variance reduction. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and
Weinberger K. Q. (eds.), Advances in Neural Information Processing Systems, Curran
Associates, Inc., pp. 315–323.
Karoui, N. E. (2008). Spectrum estimation for large dimensional covariance matrices using
random matrix theory. Annals of Statistics, 36(6):2757–2790.
Kivinen, J., Warmuth, M. K., and Hassibi, B. (2006). The p-norm generalization of the
LMS algorithm for adaptive filtering. IEEE Transactions on Signal Processing, 54(5):
1782–1793.
262 Handbook of Big Data
Korattikara, A., Chen, Y., and Welling, M. (2014). Austerity in MCMC land: Cutting
the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on
Machine Learning, pp. 181–189.
Krakowski, K. A., Mahony, R. E., Williamson, R. C., and Warmuth, M. K. (2007). A
geometric view of non-linear online stochastic gradient descent. Author Website.
Kulis, B. and Bartlett, P. L. (2010). Implicit online learning. In Proceedings of the 27th
International Conference on Machine Learning, pp. 575–582.
Lai, T. L. and Robbins, H. (1979). Adaptive design and stochastic approximation. Annals
of Statistics, 7:1196–1221.
Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of
the Royal Statistical Society. Series B (Methodological), Wiley, 57(2):425–437.
Le Cun, L. B. Y. and Bottou, L. (2004). Large scale online learning. Advances in Neural
Information Processing Systems, 16:217–224.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, volume 31. Springer,
New York.
Lehmann, E. L. and Casella, G. (2003). Theory of Point Estimation, 2nd edition. Springer,
New York.
Lions, P.-L. and Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear
operators. SIAM Journal on Numerical Analysis, 16(6):964–979.
Liu, Z., Almhana, J., Choulakian, V., and McGorman, R. (2006). Online EM algorithm for
mixture with application to internet traffic modeling. Computational Statistics & Data
Analysis, 50(4):1052–1071.
Ljung, L., Pflug, G., and Walk, H. (1992). Stochastic Approximation and Optimization of
Random Systems, volume 17. Springer Basel AG, Basel, Switzerland.
Moulines, E. and Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems,
pp. 451–459.
Murata, N. (1998). A statistical study of online learning. Online Learning and Neural
Networks. Cambridge University Press, Cambridge.
Nagumo, J.-I. and Noda, A. (1967). A learning method for system identification. IEEE
Transactions on Automatic Control, 12(3):282–287.
National Research Council (2013). Frontiers in Massive Data Analysis. National Academies
Press, Washington, DC.
Neal, R. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte
Carlo, volume 2, pp. 113–162. Chapman & Hall/CRC Press, Boca Raton, FL.
Neal, R. M. and Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental,
sparse, and other variants. In Jordan, M. I. (ed.), Learning in Graphical Models, pp. 355–
368. Springer, Cambridge, MA.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic
approximation approach to stochastic programming. SIAM Journal on Optimization,
19(4):1574–1609.
Stochastic Gradient Methods for Principled Estimation with Large Datasets 263
Nevelson, M. B. and Khasminski˘ı, R. Z. (1973). Stochastic Approximation and Recursive
Estimation, volume 47. American Mathematical Society, Providence, RI.
Nowlan, S. J. (1991). Soft competitive adaptation: Neural network learning algorithms based
on fitting statistical mixtures, Carnegie Mellon University, Pittsburgh, PA.
Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimiza-
tion, 1(3):123–231.
Pillai, N. S. and Smith, A. (2014). Ergodicity of approximate MCMC chains with
applications to large datasets. arXiv preprint: arXiv:1405.0182.
Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by
averaging. SIAM Journal on Control and Optimization, 30(4):838–855.
Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of
Mathematical Statistics, 22:400–407.
Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM
Journal on Control and Optimization, 14(5):877–898.
Rosasco, L., Villa, S., and u, B. C. (2014). Convergence of stochastic proximal gradient
algorithm. arXiv preprint: arXiv:1403.5074.
Ruppert, D. (1988a). Efficient estimations from a slowly convergent robbins-monro process.
Technical report, Cornell University Operations Research and Industrial Engineering,
Ithaca, NY.
Ruppert, D. (1988b). Stochastic approximation. Technical report, Cornell University
Operations Research and Industrial Engineering, Ithaca, NY.
Ryu, E. K. and Boyd, S. (2014). Stochastic proximal iteration: A non-asymptotic improve-
ment upon stochastic gradient descent. Author website, early draft.
Sacks, J. (1958). Asymptotic distribution of stochastic approximation procedures. The
Annals of Mathematical Statistics, 29(2):373–405.
Sakrison, D. J. (1965). Efficient recursive estimation; application to estimating the parame-
ters of a covariance function. International Journal of Engineering Science, 3(4):461–483.
Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann machines for
collaborative filtering. In Proceedings of the 24th International Conference on Machine
Learning, pp. 791–798. ACM, New York.
Sato, I. and Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevin
Dynamics by using Fokker-Planck equation and Ito process. JMLR W&CP, 32(1):
982–990.
Sato, M.-A. and Ishii, S. (2000). Online EM algorithm for the normalized Gaussian network.
Neural Computation, 12(2):407–432.
Schaul, T., Zhang, S., and LeCun, Y. (2012). No more pesky learning rates. In Proceedings
of the 30th International Conference on Machine Learning, pp. 343–351.
Schmidt, M., Le Roux, N., and Bach, F. (2013). Minimizing finite sums with the stochastic
average gradient. Technical report, HAL 00860051.
264 Handbook of Big Data
Schraudolph, N., Yu, J., and unter, S. (2007). A stochastic quasi-Newton method for online
convex optimization. In Proceedings of the 11th International Conference on Articial
Intelligence and Statistics, pp. 436–443.
Slock, D. T. (1993). On the convergence behavior of the LMS and the normalized LMS
algorithms. IEEE Transactions on Signal Processing, 41(9):2811–2825.
Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation
method. IEEE Transactions on Automatic Control, 45(10):1839–1853.
Spall, J. C. (2005). Introduction to Stochastic Search and Optimization: Estimation,
Simulation, and Control, volume 65. John Wiley & Sons, Hoboken, NJ.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine
Learning, 3(1):9–44.
Tamar, A., Toulis, P., Mannor, S., and Airoldi, E. M. (2014). Implicit temporal differences.
In Neural Information Processing Systems, Workshop on Large-Scale Reinforcement
Learning.
Taylor, G. W., Hinton, G. E., and Roweis, S. T. (2006). Modeling human motion
using binary latent variables. In Advances in Neural Information Processing Systems,
pp. 1345–1352.
Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. Journal
of the Royal Statistical Society. Series B (Methodological), 46:257–267.
Toulis, P. and Airoldi, E. M. (2015a). Implicit stochastic gradient descent. arXiv preprint:
arXiv:1408.2923.
Toulis, P. and Airoldi, E. M. (2015b). Scalable estimation strategies based on stochastic
approximations: Classical results and new insights. Statistics and Computing, 25(4):
781–795.
Toulis, P., Rennie, J., and Airoldi, E. M. (2014). Statistical analysis of stochastic gradient
methods for generalized linear models. JMLR W&CP, 32(1):667–675.
Toulis, P. and Airoldi, E. (2015a). Implicit stochastic approximation. arXiv:1510.00967.
Toulis, P., Tran, D., and Airoldi, E. M. (2015b). Towards stability and optimality in
stochastic gradient descent. arXiv:1505.02417.
Tran, D., Lan, T., Toulis, P., and Airoldi, E. M. (2015a). Stochastic Gradient Descent for
Scalable Estimation. R package version 0.1.
Tran, D., Toulis, P., and Airoldi, E. M. (2015b). Stochastic gradient descent methods for
estimation with large data sets. arXiv preprint: arXiv:1509.06459.
Venter, J. (1967). An extension of the Robbins-Monro procedure. The Annals of Mathemat-
ical Statistics, 38:181–190.
Wang, C., Chen, X., Smola, A., and Xing, E. (2013). Variance reduction for stochas-
tic gradient optimization. In Advances in Neural Information Processing Systems,
pp. 181–189.
Wang, M. and Bertsekas, D. P. (2013). Stabilization of stochastic iterative methods for
singular and nearly singular linear systems. Mathematics of Operations Research, 39(1):
1–30.
Stochastic Gradient Methods for Principled Estimation with Large Datasets 265
Wei, C. (1987). Multivariate adaptive stochastic approximation. Annals of Statistics,
15:1115–1130.
Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin
dynamics. In Proceedings of the 28th International Conference on Machine Learning,
pp. 681–688.
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention
Record, 4:96–104. (Defense Technical Information Center.)
Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive
variance reduction. SIAM Journal on Optimization, 24:2057–2075.
Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic
gradient descent. arXiv:1107.2490.
Younes, L. (1999). On the convergence of markovian stochastic algorithms with rapidly
decreasing ergodicity rates. Stochastics, 65(3–4):177–228.
Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient
descent algorithms. In Proceedings of the 21st International Conference on Machine
Learning, p. 116. ACM, New York.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset