本周六下午 15:00-17:00,我们将在学堂 112【线下】给大家带来李炳辉和邱子涵同学的报告。报告内容与机器学习理论和模型架构相关。在报告前后,同学们可以吃零食 and/or 自由交流。
报告 1 摘要
李炳辉是北京大学三年级博士生,导师为王立威老师和吴磊老师。本次报告的内容是 Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules.
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs—constant, exponential decay, and warmup–stable–decay (WSD)—and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.
报告 2 摘要
邱子涵是姚班 2020 级(计科 03)校友,现就职于 Qwen 团队预训练组,研究方向为模型结构和预训练策略。他将分享 Gating 机制对 softmax-attention 及模型整体行为的影响。Gating 机制在各个网络结构中被广泛使用,从早期的 LSTM、highway networks 到 SwiGLU,再到 mamba 和 RetNet 等各种 linear attention 变体,以及 AlphaFold2 等标准 softmax attention。论文通过详细的 ablation study 研究了在 softmax attention 中使用各种各样 gating 对模型性能及行为的影响,发现在 softmax-attention 后、attention output layer 前使用 gating 最有效,并进一步发现了这与 gating 增强了 attention 计算中的非线性和提供 query-dependent 的 sparsity 有关。此外,论文还发现这一 sparsity 还消除了网络中的 massive activation 和 attention sink 现象,提高了模型的训练稳定性和长度外推性能。该工作被 NeurIPS 2025 接收为 oral(https://arxiv.org/abs/2505.06708),并已经运用在开源模型 Qwen3-Next(https://huggingface.co/collections/Qwen/qwen3-next)中。他还将介绍在 Qwen 研究预训练的感受及 Qwen 团队的情况。
欢迎全体同学参加~
【重复一遍时间地点】北京时间本周六 11 月 1 日下午 15:00 - 17:00 清华学堂112 点击此处进行时区转换 腾讯会议 819-589-898