Research Interests

I am currently interested in nonlinear optimization with application to machine learning. Most machine learning and statistical learning problems may be posed as large-scale stochastic optimization problems over large datasets, where the size of the data constrains our capability to solve the problem efficiently. In addition, these methods must discover solutions that correspond to models which generalize well to new data. This motivates the development of new learning algorithms that admit solutions that generalize well, exploit modern computer architectures and utilize parallel programming effectively, and scale to increasingly larger datasets. By developing better algorithms for learning from massive amounts of data, I hope to design better intelligent systems to perform tasks such as image and speech recognition, image and video recovery, etc.

Current Projects

Optimization for Deep Learning

Stochastic gradient descent (SGD) is often considered the prototypical optimization algorithm for machine learning applications due to its computational efficiency and the generalizability of its solutions. SGD and its variants have been particularly effective in the training of deep neural networks, which have spurred many recent advancements in artificial intelligence. I am interested in this intersection between deep learning and optimization: why do certain optimization algorithms learn and generalize better than others? Can we develop better optimization algorithms for deep learning?

Coordinate Descent Methods

Coordinate Descent 

Coordinate descent methods (CD) are a fairly old class of algorithms that have recently gained popularity due to their effectiveness in solving large-scale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. In particular, this class of algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, following a given index and update scheme.

Since only block coordinates are updated at each iterate, variants of CD lend themselves well to parallel and distributed architectures and asynchrony, which reduce communication and idle time during computation. This is particularly useful for leveraging computing power in an era in which the single-threaded CPU speed has stopped improving, but rather the number of cores in each CPU has quickly increased.

Although these methods have existed for a long time, open questions still remain: incorporating second-order information, discovering new update schemes, developing practical greedy rules, etc. which I plan to investigate.