14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (5/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Stochastic Gradient Methods for Principled Estimation with Large Datasets 261

Fabian, V. (1973). Asymptotically eﬃcient stochastic approximation: The RM case. Annals

of Statistics, 1:486–495.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical

Transactions of the Royal Society of London. Series A, Containing Papers of a Mathe-

matical or Physical Character, 222:309–368.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh,

Scotland.

Gardner, W. A. (1984). Learning characteristics of stochastic gradient descent algorithms:

A general study, analysis, and critique. Signal Processing, 6(2):113–133.

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine

Intelligence, (6):721–741.

George, A. P. and Powell, W. B. (2006). Adaptive stepsizes for recursive estimation with

applications in approximate dynamic programming. Machine Learning, 65(1):167–198.

Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian

Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 73(2):123–214.

Gosavi, A. (2009). Reinforcement learning: A tutorial survey and recent advances. IN-

FORMS Journal on Computing, 21(2):178–192.

Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation,

and some robust and resistant alternatives. Journal of the Royal Statistical Society.

Series B (Methodological ), 46:149–192.

Hastie, T., Tibshirani, R., and Friedman, J. (2011). The Elements of Statistical Learning:

Data Mining, Inference, and Prediction, 2nd edition. Springer, New York.

Hennig, P. and Kiefel, M. (2013). Quasi-newton methods: A new direction. The Journal of

Machine Learning Research, 14(1):843–865.

Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.

Neural Computation, 14(8):1771–1800.

Hoﬀman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational

inference. The Journal of Machine Learning Research, 14(1):1303–1347.

Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive

variance reduction. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and

Weinberger K. Q. (eds.), Advances in Neural Information Processing Systems, Curran

Associates, Inc., pp. 315–323.

Karoui, N. E. (2008). Spectrum estimation for large dimensional covariance matrices using

random matrix theory. Annals of Statistics, 36(6):2757–2790.

Kivinen, J., Warmuth, M. K., and Hassibi, B. (2006). The p-norm generalization of the

LMS algorithm for adaptive ﬁltering. IEEE Transactions on Signal Processing, 54(5):

1782–1793.

262 Handbook of Big Data

Korattikara, A., Chen, Y., and Welling, M. (2014). Austerity in MCMC land: Cutting

the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on

Machine Learning, pp. 181–189.

Krakowski, K. A., Mahony, R. E., Williamson, R. C., and Warmuth, M. K. (2007). A

geometric view of non-linear online stochastic gradient descent. Author Website.

Kulis, B. and Bartlett, P. L. (2010). Implicit online learning. In Proceedings of the 27th

International Conference on Machine Learning, pp. 575–582.

Lai, T. L. and Robbins, H. (1979). Adaptive design and stochastic approximation. Annals

of Statistics, 7:1196–1221.

Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of

the Royal Statistical Society. Series B (Methodological), Wiley, 57(2):425–437.

Le Cun, L. B. Y. and Bottou, L. (2004). Large scale online learning. Advances in Neural

Information Processing Systems, 16:217–224.

Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, volume 31. Springer,

New York.

Lehmann, E. L. and Casella, G. (2003). Theory of Point Estimation, 2nd edition. Springer,

New York.

Lions, P.-L. and Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear

operators. SIAM Journal on Numerical Analysis, 16(6):964–979.

Liu, Z., Almhana, J., Choulakian, V., and McGorman, R. (2006). Online EM algorithm for

mixture with application to internet traﬃc modeling. Computational Statistics & Data

Analysis, 50(4):1052–1071.

Ljung, L., Pﬂug, G., and Walk, H. (1992). Stochastic Approximation and Optimization of

Random Systems, volume 17. Springer Basel AG, Basel, Switzerland.

Moulines, E. and Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation

algorithms for machine learning. In Advances in Neural Information Processing Systems,

pp. 451–459.

Murata, N. (1998). A statistical study of online learning. Online Learning and Neural

Networks. Cambridge University Press, Cambridge.

Nagumo, J.-I. and Noda, A. (1967). A learning method for system identiﬁcation. IEEE

Transactions on Automatic Control, 12(3):282–287.

National Research Council (2013). Frontiers in Massive Data Analysis. National Academies

Press, Washington, DC.

Neal, R. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte

Carlo, volume 2, pp. 113–162. Chapman & Hall/CRC Press, Boca Raton, FL.

Neal, R. M. and Hinton, G. E. (1998). A view of the EM algorithm that justiﬁes incremental,

sparse, and other variants. In Jordan, M. I. (ed.), Learning in Graphical Models, pp. 355–

368. Springer, Cambridge, MA.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic

approximation approach to stochastic programming. SIAM Journal on Optimization,

19(4):1574–1609.

Stochastic Gradient Methods for Principled Estimation with Large Datasets 263

Nevelson, M. B. and Khasminski˘ı, R. Z. (1973). Stochastic Approximation and Recursive

Estimation, volume 47. American Mathematical Society, Providence, RI.

Nowlan, S. J. (1991). Soft competitive adaptation: Neural network learning algorithms based

on ﬁtting statistical mixtures, Carnegie Mellon University, Pittsburgh, PA.

Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimiza-

tion, 1(3):123–231.

Pillai, N. S. and Smith, A. (2014). Ergodicity of approximate MCMC chains with

applications to large datasets. arXiv preprint: arXiv:1405.0182.

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by

averaging. SIAM Journal on Control and Optimization, 30(4):838–855.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of

Mathematical Statistics, 22:400–407.

Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM

Journal on Control and Optimization, 14(5):877–898.

Rosasco, L., Villa, S., and V˜u, B. C. (2014). Convergence of stochastic proximal gradient

algorithm. arXiv preprint: arXiv:1403.5074.

Ruppert, D. (1988a). Eﬃcient estimations from a slowly convergent robbins-monro process.

Technical report, Cornell University Operations Research and Industrial Engineering,

Ithaca, NY.

Ruppert, D. (1988b). Stochastic approximation. Technical report, Cornell University

Operations Research and Industrial Engineering, Ithaca, NY.

Ryu, E. K. and Boyd, S. (2014). Stochastic proximal iteration: A non-asymptotic improve-

ment upon stochastic gradient descent. Author website, early draft.

Sacks, J. (1958). Asymptotic distribution of stochastic approximation procedures. The

Annals of Mathematical Statistics, 29(2):373–405.

Sakrison, D. J. (1965). Eﬃcient recursive estimation; application to estimating the parame-

ters of a covariance function. International Journal of Engineering Science, 3(4):461–483.

Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann machines for

collaborative ﬁltering. In Proceedings of the 24th International Conference on Machine

Learning, pp. 791–798. ACM, New York.

Sato, I. and Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevin

Dynamics by using Fokker-Planck equation and Ito process. JMLR W&CP, 32(1):

982–990.

Sato, M.-A. and Ishii, S. (2000). Online EM algorithm for the normalized Gaussian network.

Neural Computation, 12(2):407–432.

Schaul, T., Zhang, S., and LeCun, Y. (2012). No more pesky learning rates. In Proceedings

of the 30th International Conference on Machine Learning, pp. 343–351.

Schmidt, M., Le Roux, N., and Bach, F. (2013). Minimizing ﬁnite sums with the stochastic

average gradient. Technical report, HAL 00860051.

264 Handbook of Big Data

Schraudolph, N., Yu, J., and G¨unter, S. (2007). A stochastic quasi-Newton method for online

convex optimization. In Proceedings of the 11th International Conference on Artiﬁcial

Intelligence and Statistics, pp. 436–443.

Slock, D. T. (1993). On the convergence behavior of the LMS and the normalized LMS

algorithms. IEEE Transactions on Signal Processing, 41(9):2811–2825.

Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation

method. IEEE Transactions on Automatic Control, 45(10):1839–1853.

Spall, J. C. (2005). Introduction to Stochastic Search and Optimization: Estimation,

Simulation, and Control, volume 65. John Wiley & Sons, Hoboken, NJ.

Sutton, R. S. (1988). Learning to predict by the methods of temporal diﬀerences. Machine

Learning, 3(1):9–44.

Tamar, A., Toulis, P., Mannor, S., and Airoldi, E. M. (2014). Implicit temporal diﬀerences.

In Neural Information Processing Systems, Workshop on Large-Scale Reinforcement

Learning.

Taylor, G. W., Hinton, G. E., and Roweis, S. T. (2006). Modeling human motion

using binary latent variables. In Advances in Neural Information Processing Systems,

pp. 1345–1352.

Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. Journal

of the Royal Statistical Society. Series B (Methodological), 46:257–267.

Toulis, P. and Airoldi, E. M. (2015a). Implicit stochastic gradient descent. arXiv preprint:

arXiv:1408.2923.

Toulis, P. and Airoldi, E. M. (2015b). Scalable estimation strategies based on stochastic

approximations: Classical results and new insights. Statistics and Computing, 25(4):

781–795.

Toulis, P., Rennie, J., and Airoldi, E. M. (2014). Statistical analysis of stochastic gradient

methods for generalized linear models. JMLR W&CP, 32(1):667–675.

Toulis, P. and Airoldi, E. (2015a). Implicit stochastic approximation. arXiv:1510.00967.

Toulis, P., Tran, D., and Airoldi, E. M. (2015b). Towards stability and optimality in

stochastic gradient descent. arXiv:1505.02417.

Tran, D., Lan, T., Toulis, P., and Airoldi, E. M. (2015a). Stochastic Gradient Descent for

Scalable Estimation. R package version 0.1.

Tran, D., Toulis, P., and Airoldi, E. M. (2015b). Stochastic gradient descent methods for

estimation with large data sets. arXiv preprint: arXiv:1509.06459.

Venter, J. (1967). An extension of the Robbins-Monro procedure. The Annals of Mathemat-

ical Statistics, 38:181–190.

Wang, C., Chen, X., Smola, A., and Xing, E. (2013). Variance reduction for stochas-

tic gradient optimization. In Advances in Neural Information Processing Systems,

pp. 181–189.

Wang, M. and Bertsekas, D. P. (2013). Stabilization of stochastic iterative methods for

singular and nearly singular linear systems. Mathematics of Operations Research, 39(1):

1–30.

Stochastic Gradient Methods for Principled Estimation with Large Datasets 265

Wei, C. (1987). Multivariate adaptive stochastic approximation. Annals of Statistics,

15:1115–1130.

Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin

dynamics. In Proceedings of the 28th International Conference on Machine Learning,

pp. 681–688.

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention

Record, 4:96–104. (Defense Technical Information Center.)

Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive

variance reduction. SIAM Journal on Optimization, 24:2057–2075.

Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic

gradient descent. arXiv:1107.2490.

Younes, L. (1999). On the convergence of markovian stochastic algorithms with rapidly

decreasing ergodicity rates. Stochastics, 65(3–4):177–228.

Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient

descent algorithms. In Proceedings of the 21st International Conference on Machine

Learning, p. 116. ACM, New York.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (5/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (5/6)