Moonlight:基于数据处理器的大模型训练在网检查点结构

赵巍岳* **; 吴婧雅*; 卢文岩*; 李华伟*; 李晓维*; 鄢贵海*

文章摘要

赵巍岳* **,吴婧雅*,卢文岩*,李华伟*,李晓维*,鄢贵海*.Moonlight:基于数据处理器的大模型训练在网检查点结构[J].高技术通讯(中文),2026,36(3):230~243

Moonlight:基于数据处理器的大模型训练在网检查点结构

Moonlight: DPU-based in-network checkpoint structure for large model training

DOI：10. 3772 / j. issn. 1002 - 0470. 2026. 03. 002

中文关键词: 检查点; 远程直接内存访问; 数据处理器; 大模型训练

英文关键词: checkpoint, remote memory direct access, data processing unit, large model training

基金项目:

作者	单位
赵巍岳* **	(* 处理器芯片全国重点实验室(中科学院计算技术研究所) 北京 100190) (** 中国科学院大学北京 100049)
吴婧雅*
卢文岩*
李华伟*
李晓维*
鄢贵海*

摘要点击次数: 325

全文下载次数: 270

中文摘要:

大模型参数数量众多,训练大模型需要成千上万的神经网络加速器协同工作数天甚至数月的时间。由于并行训练算法需要频繁的数据同步,某一个训练节点的错误将使训练进度丢失,导致严重的算力资源浪费。大模型训练集群通常使用检查点( checkpoint)帮助训练从错误中恢复,但是当前的检查点方案带来了严重的中央处理器( central processing unit,CPU)使用和内存开销,并且检查点保存效率受限于远端较低的存储带宽。为解决这些问题,本文以数据处理器(data processing unit,DPU)为平台提出了一种新的在网检查点方案 Moonlight。 Moonlight 将检查点操作负载从主机卸载至网络设备中,通过设计硬件结构提供管理、控制检查点的功能,使用网络设备存储实现多级的检查点存储,为大模型训练提供了高效的检查点结构。实验结果表明:(1)Moonlight 能够有效卸载主机的检查点操作负荷,检查点数据面不产生主机 CPU 开销,主机内存的开销可忽略不计;(2)Moonlight 能够提供高效的检查点保存功能,后端存储带宽是现有商业方案的 4. 10 倍,检查点数据包的保存效率是基准方案的 1. 96 倍。

英文摘要:

Large models have a vast number of parameters, and training them requires thousands of neural network accelerators to work together for days or even months. Due to the frequent data synchronization required by parallel training algorithms, an error in a single training node can lead to the loss of training progress and significant waste of computing resources. Existing training clusters typically use checkpoint to help recover from errors. However, current checkpoint strategies bring severe host central processing unit (CPU) usage and memory overhead, and the efficiency of checkpoint operations is limited by the low bandwidth of remote storage. To address these issues, this paper proposes a new in-network checkpoint strategy called Moonlight, based on the data processing unit (DPU).Moonlight offloads the checkpoint operation workload from the host to the network devices, provides checkpoint management and control functions through hardware structure designs, and implements hierarchical checkpoint storage using network device storage, offering an efficient checkpoint structure for large model training. Experimental results show that: (1) Moonlight can effectively offload the checkpoint workload from hosts, with no host CPU usage at the checkpoint data plane and negligible host memory overhead. (2) Moonlight can provide efficient checkpoint saving functions, with the backend storage bandwidth being 4. 10 times that of existing commercial solutions and the sending efficiency of checkpoint data packets being 1. 96 times that of the baseline solution.

查看全文查看/发表评论下载PDF阅读器

关闭