## Research## Overview: Optimization for Deep LearningDeep learning pervades many of the recent breakthroughs in artificial intelligence, including computer vision, speech recognition, and natural language processing. These methods are inspired by biological neural networks - populations of neurons that transmit signals to each other when activated. Based on this biological inspiration, researchers have designed ## Scalable Second-Order Methods for Deep Learning and Stochastic OptimizationDespite many breakthroughs in the development of deep learning models, the common approach for training neural networks remains the stochastic gradient (SG) method (with momentum). Due to its use of small batch sizes in a sequential manner, the SG method suffers from the lack of opportunities for parallelism. This has led to recent work investigating in progressively increasing the batch size during the optimization while matching the generalization performance of small batch methods with known steplength (or learning rate) schedules. Larger batch sizes open up new opportunities for incorporating known second-order information to potentially expedite the training process. I am interested in understanding how to incorporate second-order information into learning algorithms (and more broadly, stochastic optimization algorithms) in a scalable and parallelizable manner, while understanding the strengths and limitations of such approaches. ## Adaptive Gradient MethodsAnother set of algorithms for training deep neural networks has emerged that utilize diagonal scalings to expedite or simplify training. These ## Optimization and GeneralizationAs described above, the training of neural networks is a highly non-convex problem. The purpose of training these complex models (on a given dataset) is for the purpose of Most statistical learning theory has depended on defining the complexity or capacity of a statistical model to bound the gap between the solution of the problem on the given dataset and the data's true distribution. However, current theory has proven insufficient in describing the generalization behavior of neural networks due to the (current) inability to characterize the effective capacity of the network. Some researchers have hypothesized that optimization algorithms play a key role in biasing towards solutions that generalize well, through a form of ‘‘implicit regularization’’ or something else. I want to understand how optimization methods are relevant in determining the generalization properties of neural networks to design better learning algorithms for machine learning. |