Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP-generated sharding configurations can contribute up to 28% training step time reduction and 1.43 times throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58 times. The proposed ASAP promises to provide a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training.
翻译:在分布式领域专用加速器系统上优化大规模语言模型(LLM)训练,因其复杂的优化空间而面临重大挑战。然而,现有的优化方法依赖于耗时的手动调优或资源密集的黑盒搜索,难以跟上快速发展的LLM领域步伐,导致开发缓慢和资源利用不足。为解决此问题,我们提出了ASAP(一种用于自动优化大规模语言模型训练性能的智能体解决方案)。它是一个多智能体系统,包含协调器、分析器和建议器智能体,该系统将LLM推理与性能分析工具、屋顶线分析以及来自人类专家的最佳实践和过往成功优化案例的知识库相结合。我们提出的设计能够自动诊断性能瓶颈,并通过推理推荐优化的分片配置,从而有效提升分布式LLM训练的效率。实验表明,ASAP生成的分片配置可贡献高达28%的训练步长时间缩减和1.43倍的吞吐量提升。当结合人类专家的额外优化时,吞吐量可进一步提升至2.58倍。所提出的ASAP有望为大规模LLM训练中的AI辅助性能工程提供一种可扩展且可解释的方法论。