在全栈AMD平台上训练基础模型：计算、网络与系统设计 (Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design)

Quentin Anthony,Yury Tokpanov,Skyler Szot,Srivatsan Rajagopal,Praneeth Medepalli,Anna Golubeva,Vasu Shyam,Robert Washbourne,Rishi Iyer,Ansh Chaurasia,Tomas Figliolia,Xiao Yang,Abhinav Sarje,Drew Thorstensen,Amartey Pearson,Zack Grossbart,Jason van Patten,Emad Barsoum,Zhenyu Gu,Yao Fu,Beren Millidge

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available at https://huggingface.co/Zyphra/ZAYA1-base) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

翻译：我们报告了首个在纯AMD硬件上进行的大规模专家混合模型预训练研究，同时利用了MI300X GPU和Pollara网络。我们提炼了针对系统与模型设计的实用指导。在系统方面，我们提供了全面的集群与网络特性分析：通过Pollara网络，针对所有核心集合通信操作（全归约、归约-分散、全收集、广播）在不同消息大小和GPU数量下的微基准测试。据我们所知，这是首次在此规模上进行此类测试。我们进一步提供了关于内核规模与内存带宽的MI300X微基准测试，以指导模型设计。在模型方面，我们引入并应用了针对MI300X的Transformer规模调整规则，用于注意力与MLP模块，并论证了能同时优化训练吞吐量与推理延迟的MoE宽度。我们深入描述了训练栈，包括常被忽略的实用工具，如容错与检查点重塑，以及训练配方的详细信息。我们还预览了我们的模型架构与基础模型——ZAYA1（760M激活参数，83亿总参数的MoE模型，可在https://huggingface.co/Zyphra/ZAYA1-base获取）——该模型将在后续论文中进一步改进。ZAYA1-base在其规模及更大规模上，性能可与领先的基础模型如Qwen3-4B和Gemma3-12B相媲美，并在推理、数学和编码基准测试中超越了包括Llama-3-8B和OLMoE在内的模型。综合而言，这些结果表明AMD硬件、网络和软件栈已足够成熟和优化，能够支持具有竞争力的大规模预训练。