SIGMA：面向早期生命周期硬件的AI赋能训练栈 (SIGMA: An AI-Empowered Training Stack on Early-Life Hardware)

Lei Qu,Lianhai Ren,Peng Cheng,Rui Gao,Ruizhe Wang,Tianyu Chen,Xiao Liu,Xingjian Zhang,Yeyun Gong,Yifan Xiong,Yucheng Ding,Yuting Jiang,Zhenghao Lin,Zhongxin Guo,Ziyue Yang

from arxiv, 22 pages, 7 figures

An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.

翻译：越来越多类型的AI加速器被考虑用于大规模训练。然而，在早期生命周期的AI加速器上实现大规模训练面临三个核心挑战：频繁的系统中断和未定义的故障模式损害了可靠性；数值误差和训练不稳定性威胁了正确性与收敛性；并行优化复杂性结合不可预测的局部噪声降低了效率。为应对这些挑战，SIGMA是一个开源训练栈，旨在提升早期生命周期AI硬件上大规模分布式训练的可靠性、稳定性和效率。该倡议的核心是LUCIA训练平台（LTP），这是一个为配备早期生命周期AI加速器的集群优化的系统。自2025年3月启动以来，LTP显著提升了训练可靠性和运营生产力。在过去五个月中，它实现了令人瞩目的94.45%有效集群加速器利用率，同时大幅减少了节点回收和作业恢复时间。基于LTP的基础，LUCIA训练框架（LTF）成功使用2,048个AI加速器训练了SIGMA-MOE，一个200B的MoE模型。这项工作取得了卓越的稳定性和效率成果：实现了21.08%的MFU、最先进的下游精度，并在75天周期内仅遇到一次稳定性事件。这些进展共同确立了SIGMA，它不仅解决了大规模训练的关键挑战，还为AI基础设施和平台创新设立了新标杆，为主流成熟加速器栈提供了稳健且具成本效益的替代方案，显著推进了AI能力与可扩展性。SIGMA的源代码可在https://github.com/microsoft/LuciaTrainingPlatform获取。

相关内容

关注 7072

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日