Scaling Stochastic Momentum from Theory to LLMs
CMSA NEW TECHNOLOGIES IN MATHEMATICS
Given the massive scale of modern ML models, we now often get only a single shot to train them effectively. This limits our ability to sweep architectures and hyperparameters, making it essential to understand how learning algorithms scale so insights from small models transfer to large ones.
In this talk, I present a framework for analyzing scaling laws of stochastic momentum methods using a power-law random features model, leveraging tools from high-dimensional probability and random matrix theory. We show that standard SGD with momentum does not improve scaling exponents, while dimension-adapted Nesterov acceleration (DANA)—which explicitly adapts momentum to model size and data/target complexity—achieves strictly better loss and compute scaling. DANA does this by rescaling its momentum parameters with dimension, effectively matching the optimizer’s memory to the problem geometry.
Motivated by these theoretical insights, I introduce logarithmic-time scheduling for large language models and propose ADANA, an AdamW-like optimizer with growing memory and explicit damping. Across transformer scales (45M to 2.6B parameters), ADANA yields up to 40% compute savings over tuned AdamW, with gains that improve at scale.
Based on joint work with Damien Ferbach, Elliot Paquette, Katie Everett, and Gauthier Gidel.
Zoom: https://harvard.zoom.us/j/91864143060?pwd=liDbUVYXs47QsYhxdzXYowl8vpQGy1.1
