Seminar #82

时间: 2025-11-01 15:00-17:00 地点: 清华学堂112 + 腾讯会议 seminar

本周六下午 15:00-17:00，我们将在学堂 112【线下】给大家带来李炳辉和邱子涵同学的报告。报告内容与机器学习理论和模型架构相关。在报告前后，同学们可以吃零食 and/or 自由交流。

报告 1 摘要

李炳辉是北京大学三年级博士生，导师为王立威老师和吴磊老师。本次报告的内容是 Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules.

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs—constant, exponential decay, and warmup–stable–decay (WSD)—and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.
报告 2 摘要

邱子涵是姚班 2020 级（计科 03）校友，现就职于 Qwen 团队预训练组，研究方向为模型结构和预训练策略。他将分享 Gating 机制对 softmax-attention 及模型整体行为的影响。Gating 机制在各个网络结构中被广泛使用，从早期的 LSTM、highway networks 到 SwiGLU，再到 mamba 和 RetNet 等各种 linear attention 变体，以及 AlphaFold2 等标准 softmax attention。论文通过详细的 ablation study 研究了在 softmax attention 中使用各种各样 gating 对模型性能及行为的影响，发现在 softmax-attention 后、attention output layer 前使用 gating 最有效，并进一步发现了这与 gating 增强了 attention 计算中的非线性和提供 query-dependent 的 sparsity 有关。此外，论文还发现这一 sparsity 还消除了网络中的 massive activation 和 attention sink 现象，提高了模型的训练稳定性和长度外推性能。该工作被 NeurIPS 2025 接收为 oral（https://arxiv.org/abs/2505.06708），并已经运用在开源模型 Qwen3-Next（https://huggingface.co/collections/Qwen/qwen3-next）中。他还将介绍在 Qwen 研究预训练的感受及 Qwen 团队的情况。

欢迎全体同学参加~

【重复一遍时间地点】北京时间本周六 11 月 1 日下午 15:00 - 17:00 清华学堂112 点击此处进行时区转换腾讯会议 819-589-898
【查看往期】https://group.iiis.tsinghua.edu.cn/~stu/seminar/

Seminar #82

联系我们

清华大学姚班研讨会