Research

Current Research Interests

  1. Theory of Optimization (Convex , Non-Convex, and Distributed).

  2. SGD Generalisation via SDEs and Markov process Theory.

  3. Theory of Neural Networks.

  4. Explicit and Implicit Bias in Optimization Algorithm.

  5. Sampling from General Distributions (Heavy-tailed, Multimodal, and Discrete) and Optimization.

  6. Statistical Learning beyond the Standard Models and Assumptions (Non-iid data, Heavy-tailed Noise, and Distribution Shift).

  7. Learning under resource constraints (Active Learning, and Coresets).

  8. Reinforcement learning and Optimization.

  9. Machine learning for healthcare applications.

Past Research Works

  1. Convex optimization theory.

  2. Kernel Methods.

  3. Learning with distributions.

  4. Non-parametric methods.

  5. Computer vision.

Marie-Curie Fellowship

Implicit and Explicit Bias in Optimization Algorithms

In modern machine learning, successful models are often highly overparameterized, possessing more trainable parameters than training examples. This leads to multiple minimizers that tend to overfit training data and fare poorly with new examples. Surprisingly, when using simple algorithms like (stochastic) gradient descent, specific minimizers yielded have exceptional performance on new data. By selecting a particular training loss minimizer, the optimization algorithm introduces extra learning bias. These biases manifest in two distinct forms: explicit and implicit. Explicit biases are consciously integrated into the objective function to induce solutions with favorable generalization traits. Implicit biases, conversely, stem intrinsically from the optimization process properties, devoid of explicit instructions. Studying models shaped by this bias is pivotal for understanding both the triumphs and constraints of overparameterized models, potentially giving rise to innovative algorithms.

Contributions: A series of papers [1, 2, 3, 4] have underscored the correlation between flatter minima and enhanced generalization. In a recent departure, our work [5] introduces a novel paradigm by injecting noise into model parameters to encourage the exploration of wider minima. This exploration seeks to comprehend the explicit regularization effects induced by modest perturbations across a spectrum of machine learning models, encompassing both simple models and complex over-parametrized neural networks. Our investigation reveals how such perturbations, particularly in simple models, engender explicit regularization grounded in diverse norms such as l1-norm, group l1-norm, and nuclear norms. Our earlier work [6] unlocks a compelling connection between the implicit and explicit biases inherent in optimization algorithms. This insight, attained through leveraging duality, not only expedites optimization in convex over-parametrized models but also unveils the implicit convergence of stochastic mirror descent to the minimum mirror map solution. In a recent study [7], our exploration strides into investigating the implicit bias of mirror gradient flow and mirror Langevin within the context of convex problems from an optimal control perspective. This is the first variational formulation of continuous time mirror descent and mirror langevin algorithm. The result is a revelation of mirror descent as a closed-loop solution for a specific optimal control scenario, with the Bellman value function forged from the Bregman divergence between the initial condition and the global minimizer of the objective function. This outcome [7] extends far beyond our prior findings [6]. This body of research uncovers the intricate interplay between biases, vital for optimized solutions across problem domains.

  1. Flat Minima
    Sepp Hochreiter and Jürgen Schmidhuber
    Neural computation, 9(1):1–42, 1997.

  2. Wide flat minima and optimal generalization in classifying high-dimensional gaussian mixtures
    Carlo Baldassi, Enrico M Malatesta, Matteo Negri, and Riccardo Zecchina
    Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124012, 2020.

  3. Unveiling the structure of wide flat minima in neural networks.
    Carlo Baldassi, Clarissa Lauditi, Enrico M Malatesta, Gabriele Perugini, and Riccardo Zecchina
    Physical Review Letters, 127(27):278301, 2021.

  4. Entropic gradient descent algorithms and wide flat minima
    Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta Demyanenko, and Riccardo Zecchina
    Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124015, 2021.

  5. Explicit regularization in overparametrized models via noise injection
    Antonio Orvieto*, Anant Raj*, Hans Kersting, and Francis Bach
    International Conference on Artificial Intelligence and Statistics, pages 7265–7287. PMLR, 2023.

  6. Explicit regularization of stochastic gradient methods through duality
    Anant Raj and Francis Bach
    International Conference on Artificial Intelligence and Statistics, pages 1882–1890. PMLR, 2021.

  7. Variational principles for mirror descent and mirror langevin dynamics
    Belinda Tzen, Anant Raj, Maxim Raginsky, and Francis Bach
    IEEE Control Systems Letters, 2023.

Generalization of SGD for Non-Convex Learning

In machine learning theory, one important research challenge is to understand the behavior and effect of noise in the stochastic gradients on generalization performance while training a model in a non-convex landscape. The noise in the stochastic gradient can be beneficial in helping the optimizer to escape local minima and provide exploration in rugged landscapes, hence helping in achieving better generalization performance. However, the structure of noise depends on the model parameter, loss landscape, and model architecture, and are correlated. That makes it hard to study the effect of noise on generalization. Recent strides [4, 5] have utilized Stochastic Differential Equations (SDEs) as proxies for SGD, enabling insights into the effect of noise and generalization in the non-convex landscape. The Stochastic Gradient Langevin Dynamics is a commonly employed SDE for this purpose where the structure of the noise is Gaussian. However, it’s crucial to recognize that not all gradient noise adheres to Gaussian assumptions. A series of studies [6, 7] unveiled heavy-tailed noise behavior in constant step size SGD training of neural networks due to multiplicative gradient noise.

Contributions: In our recent work [1], we’ve pioneered novel connections between tail behavior and the generalization characteristics of SGD. This was achieved by studying the algorithmic stability of a heavy-tailed SDE, which emulates the observed heavy-tailed behavior in SGD, as demonstrated in the context of a quadratic optimization problem. Notably, our work reveals a non-monotonic relation between the coefficient of heavy tail and generalization. This intriguing insight also aligns with prior empirical findings. Expanding upon this foundation, the extension to general loss functions, including non-convex ones, has been undertaken in [2]. In this extension, we develop Wasserstein stability bounds for heavy-tailed SDEs, which are subsequently converted into generalization bounds. Importantly, these results are underpinned by fundamental principles and do not necessitate intricate assumptions. In the endeavor to bridge the void between Markov chain theory, and SGD theory, in a recent work, we utilize the stability result of Markov chains [3] to derive generalization Bounds for (Noisy) Stochastic Gradient Descent that also includes SDEs across a broad spectrum of loss functions, encompassing a subset of non-convex functions. This is the first work to utilize the perturbation theory of Markov chains in the analysis of the generalization properties of variants of SGD.

  1. Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares
    Anant Raj, Melih Barsbey, Mert Gürbüzbalaban, Lingjiong Zhu and Umut Şimşekli
    International Conference on Algorithmic Learning Theory (ALT), 2023.

  2. Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions
    Anant Raj*, Lingjiong Zhu*, Mert Gürbüzbalaban and Umut Şimşekli
    International Conference on Machine Learning (ICML), 2023.

  3. Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
    Lingjiong Zhu, Mert Gürbüzbalaban Anant Raj and Umut Şimşekli
    Advances in Neural Information Processing Systems (Neurips), 2023.

  4. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis
    Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky
    Conference on Learning Theory (COLT), 2017.

  5. On Learning Rates and Schrödinger Operators
    Bin Shi, Weijie J. Su, and Michael I. Jordan
    arXiv preprint arXiv:2004.06977, 2020.

  6. The heavy-tail phenomenon in sgd
    Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu
    International Conference on Machine Learning, pages 3964–3975. PMLR, 2021.

  7. Multiplicative noise and heavy tails in stochastic optimization
    Liam Hodgkinson and Michael Mahoney
    International Conference on Machine Learning, pages 4262–4274. PMLR, 2021.

Learning in Neural-Networks via Gradient Descent

Currently, we are focussing on examining how stochastic gradient descent (SGD) and gradient descent (GD) optimize overparametrized deep neural networks. While there has been progress in understanding feature learning in neural networks through SGD and GD over the past few years, it remains constrained by narrow and strong assumptions. Our aim is to delve into the feature learning processes within complex deep neural networks and foundation models during training with GD or SGD, aiming to achieve near-optimal information-theoretical guarantees for feature learning. Furthermore, we are investigating the feature learning phenomena in generative models, exploring what kinds of score functions can be learned using gradient descent.