Stochastic Gradient Methods For Large-Scale Machine Learning

Slides

Tutorial description

This tutorial provides an accessible introduction to the mathematical properties of stochastic gradient methods and their consequences for large scale machine learning. After reviewing the computational needs for solving optimization problems in two typical examples of large scale machine learning, namely, the training of sparse linear classifiers and deep neural networks, we present the theory of the simple, yet versatile stochastic gradient algorithm, explain its theoretical and practical behavior, and expose the opportunities available for designing improved algorithms. We then provide specific examples of advanced algorithms to illustrate the two essential directions for improving stochastic gradient methods, namely, managing the noise and making use of second order information.

Goals

The importance of optimization for machine learning is well recognized. The goal of this tutorial is to provide young researchers with sufficient material to understand current advances in relevant contexts. What are the basic algorithms and their properties? What are the avenues for improvement? What kinds of improvement can be expected? What kinds cannot? What are the connections between popular variants of stochastic gradient algorithms? etc.

Target audience

The tutorial primarily targets young machine learning researchers with a working knowledge of multivariate analysis. Its contents will also be of interest to researchers who are familiar with many specific contributions in optimization for machine learning, but would appreciate a more unified perspective on the topic.

Presenters

The current plan is to split the tutorial evenly between the three presenters:

About the speakers

Leon Bottou has written multiple papers on the use of stochastic gradient methods for machine learning, such as “Stochastic Gradient Learning in Neural Networks” (1991), “Online Algorithms and Stochastic Approximations” (1998), and “The Tradeoffs of Large Scale Learning” (2007). Frank E. Curtis is a well known researcher in optimization with a focus on algorithms for solving large-scale nonlinear problems. Jorge Nocedal is one of the best known researchers in optimization. He is particularly known for his work on limited memory methods (L-BFGS) and his classic book “Numerical Optimization” (Nocedal and Wright, 2000, 2006).