An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.
翻译:越来越多类型的AI加速器被考虑用于大规模训练。然而,在早期生命周期的AI加速器上实现大规模训练面临三个核心挑战:频繁的系统中断和未定义的故障模式损害了可靠性;数值误差和训练不稳定性威胁了正确性与收敛性;并行优化复杂性结合不可预测的局部噪声降低了效率。为应对这些挑战,SIGMA是一个开源训练栈,旨在提升早期生命周期AI硬件上大规模分布式训练的可靠性、稳定性和效率。该倡议的核心是LUCIA训练平台(LTP),这是一个为配备早期生命周期AI加速器的集群优化的系统。自2025年3月启动以来,LTP显著提升了训练可靠性和运营生产力。在过去五个月中,它实现了令人瞩目的94.45%有效集群加速器利用率,同时大幅减少了节点回收和作业恢复时间。基于LTP的基础,LUCIA训练框架(LTF)成功使用2,048个AI加速器训练了SIGMA-MOE,一个200B的MoE模型。这项工作取得了卓越的稳定性和效率成果:实现了21.08%的MFU、最先进的下游精度,并在75天周期内仅遇到一次稳定性事件。这些进展共同确立了SIGMA,它不仅解决了大规模训练的关键挑战,还为AI基础设施和平台创新设立了新标杆,为主流成熟加速器栈提供了稳健且具成本效益的替代方案,显著推进了AI能力与可扩展性。SIGMA的源代码可在https://github.com/microsoft/LuciaTrainingPlatform获取。