内存受限场景下的高效MoE服务：平衡激活专家而非令牌 (Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens)

Yanpeng Yu,Haiyue Ma,Krish Agarwal,Nicolai Oswald,Qijing Huang,Hugo Linsenmaier,Chunhui Mei,Ritchie Zhao,Ritika Borkar,Bita Darvish Rouhani,David Nellans,Ronny Krashinsky,Anurag Khandelwal

Expert Parallelism (EP) permits Mixture of Experts (MoE) models to scale beyond a single GPU. To address load imbalance across GPUs in EP, existing approaches aim to balance the number of tokens each GPU processes. Surprisingly, we find that this objective degrades performance rather than improving it when processing is memory-bound - a common occurrence in MoE serving, especially in the decode phase. Our analysis reveals that balancing the number of tokens processed per GPU increases the number of activated experts, exacerbating memory pressure in the memory-bound regime. We propose Minimum Expert Token ROuting, a novel token-routing algorithm for high-performance expert-parallel MoE serving in the memory-bound regime that balances the number of activated experts per GPU rather than token counts. METRO achieves near-optimal routing quality with minimal computational overhead by jointly optimizing algorithmic efficiency and leveraging the GPU's parallel processing power. To guarantee routing quality, METRO also employs a novel allGather scheme to gather global top-k knowledge, which has minimal overhead compared to conventional allToAll. Our evaluation of METRO against EPLB on both real systems (vLLM over 8 A100 GPUs) and a proprietary simulator (8-16 B200 GPUs) shows that METRO reduces decode latency by 11 - 22%, and total token throughput by 3 - 21% for Qwen3 and DeepSeek-V3 serving, where prefill and decode phases are co-deployed. In addition, by trading latency headroom for throughput, METRO improves decode throughput by up to 4.11x over EPLB at a fixed decode SLO.

翻译：专家并行（EP）允许混合专家（MoE）模型扩展到单个GPU之外。为解决EP中GPU间的负载不均衡问题，现有方法旨在平衡每个GPU处理的令牌数量。令人惊讶的是，我们发现当处理过程受内存限制时——这在MoE服务中尤其常见，特别是在解码阶段——这一目标反而会降低性能而非提升性能。我们的分析表明，平衡每个GPU处理的令牌数量会增加激活专家的数量，从而在内存受限场景下加剧内存压力。我们提出最小专家令牌路由（METRO），这是一种新颖的令牌路由算法，用于在内存受限场景下实现高性能的专家并行MoE服务，它平衡的是每个GPU的激活专家数量而非令牌数量。METRO通过联合优化算法效率并利用GPU的并行处理能力，以最小的计算开销实现了接近最优的路由质量。为保证路由质量，METRO还采用了一种新颖的allGather方案来收集全局top-k知识，与传统的allToAll相比开销极小。我们在真实系统（基于8个A100 GPU的vLLM）和专有模拟器（8-16个B200 GPU）上对METRO与EPLB进行评估，结果表明，在Qwen3和DeepSeek-V3服务（其中预填充和解码阶段共同部署）中，METRO将解码延迟降低了11-22%，总令牌吞吐量提高了3-21%。此外，通过以延迟裕度换取吞吐量，在固定的解码SLO下，METRO的解码吞吐量相比EPLB提升了最高达4.11倍。