本周六下午 14:00-16:00,我们将线上给大家带来温凯越同学的报告。报告内容与优化器相关。
报告摘要
温凯越是斯坦福大学博士二年级学生,导师是 Tengyu Ma 和 Percy Liang;他本科毕业于清华大学姚班。他的研究主要关注结合实验与理论理解语言模型训练。
Title: Fantastic Pretraining Optimizers and Where to Find Them I & II: Benchmarking Optimizers & Trying to Maintain Sustainable Speedup
Abstract: AdamW has long remained the default optimizer for large language model (LLM) pretraining, despite the emergence of numerous alternatives claiming 1.4× to 2× speedups. In this talk, we present a systematic empirical study ( https://arxiv.org/abs/2509.02046 ) that critically re-evaluates eleven leading deep learning optimizers—including scalar-based methods like Lion and Mars, and matrix-based preconditioners like Muon, SOAP, and Kron—across model scales ranging from 130M to 1.2B parameters and varying data-to-model ratios (1× to 8× Chinchilla optimal). We identify two critical methodological flaws in prior comparisons: unequal hyperparameter tuning and misleading evaluation setups.
Our rigorous benchmarking reveals that many claimed gains are artifacts of weak baselines. While matrix-based optimizers consistently outperform scalar baselines at smaller scales (achieving ~1.3× speedup), this advantage diminishes significantly as model size increases, dropping to ~1.1× for 1.2B parameter models. Furthermore, we show that the “best” optimizer is regime-dependent: while Muon excels in data-scarce settings, full-matrix preconditioners like SOAP and Kron dominate in over-trained regimes. We also observe some interesting phenomena across optimizers related to learning rate schedule and weight decay.
Finally, we discuss insights from our recent follow-up blog (https://whenwen.github.io/wd_blog/public/index.html ), that summarize theories behind the counterintuitive phenomenon we found in our previous paper by showing that weight decay will induce an ‘effective learning rate schedule’, differing from the explicit one. Motivated by this theory, we propose a new meta-optimizer called Hyperball that normalizes both optimizer update and weight matrix in Frobenius norm. The new method shows the following benefits: (1) eliminate the need to tune weight decay; (2) native hyperparameter transfer across width and depth; and (3) lead to more sustainable speedup than training with decoupled weight decay across model scales.
欢迎全体同学参加~
【重复一遍时间地点】北京时间本周六 12 月 13 日下午 14:00 - 16:00 线上 点击此处进行时区转换 腾讯会议 293-358-412