Overview: Optimization for Deep Learning

Deep learning pervades many of the recent breakthroughs in artificial intelligence, including computer vision, speech recognition, and natural language processing. These methods are inspired by biological neural networks - populations of neurons that transmit signals to each other when activated. Based on this biological inspiration, researchers have designed artificial neural network models that mimic this behavior by interleaving compositions of linear and nonlinear functions, where each linear function represents the connections between consecutive layers of artificial neurons and the nonlinear function (applied component-wise) represents the activation of each neuron. After passing these models through a loss function, which describes the goodness-of-fit between a model's predicted value and true value, we obtain a highly non-convex and non-smooth optimization problem that is immensely difficult to optimize efficiently. The goal of my research is to develop highly scalable, efficient optimization algorithms for training deep networks that yield solutions that generalize well.

Scalable Second-Order Methods for Deep Learning and Stochastic Optimization

Despite many breakthroughs in the development of deep learning models, the common approach for training neural networks remains the stochastic gradient (SG) method (with momentum). Due to its use of small batch sizes in a sequential manner, the SG method suffers from the lack of opportunities for parallelism. This has led to recent work investigating in progressively increasing the batch size during the optimization while matching the generalization performance of small batch methods with known steplength (or learning rate) schedules. Larger batch sizes open up new opportunities for incorporating known second-order information to potentially expedite the training process. I am interested in understanding how to incorporate second-order information into learning algorithms (and more broadly, stochastic optimization algorithms) in a scalable and parallelizable manner, while understanding the strengths and limitations of such approaches.

Adaptive Gradient Methods

Another set of algorithms for training deep neural networks has emerged that utilize diagonal scalings to expedite or simplify training. These adaptive gradient methods scale the stochastic gradient by dividing by the average of past squared gradients. These methods originated from minimizing the regret bound for the online gradient method over a set of preconditioning matrices. However, these methods remain poorly understood. In particular, what are these diagonal scalings approximating? How do they improve the performance of the stochastic gradient methods? Are these diagonal scaling methods appropriate in non-stochastic contexts?

Optimization and Generalization

As described above, the training of neural networks is a highly non-convex problem. The purpose of training these complex models (on a given dataset) is for the purpose of predicting the behavior of new, unseen data. On one hand, the different minima that exist over the optimization landscape for neural networks may correspond to different models that generalize differently. On the other hand, due to overparameterization and symmetries or invariances that exist within the neural network, there exist manifolds of solutions that correspond to the same model. We are not just interested in finding any solution, but solutions that generalize well.

Most statistical learning theory has depended on defining the complexity or capacity of a statistical model to bound the gap between the solution of the problem on the given dataset and the data's true distribution. However, current theory has proven insufficient in describing the generalization behavior of neural networks due to the (current) inability to characterize the effective capacity of the network. Some researchers have hypothesized that optimization algorithms play a key role in biasing towards solutions that generalize well, through a form of ‘‘implicit regularization’’ or something else. I want to understand how optimization methods are relevant in determining the generalization properties of neural networks to design better learning algorithms for machine learning.