The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.
翻译:大语言模型(LLM)服务向复杂分布式架构(特别是参数/执行分离的大规模数据并行+专家并行范式)的演进,带来了独特的调度挑战。与传统部署中调度器可将实例视为黑盒不同,数据并行+专家并行架构具有高昂的内部同步成本。我们发现,此类系统中立即请求分派会导致严重的引擎内部排队与并行化气泡,从而劣化首词生成时间。为解决该问题,我们提出交错批次调度机制,通过主动缓冲请求以形成最优执行批次。这种时间解耦在保持吞吐量的同时消除了内部排队气泡。此外,利用缓冲创建的调度窗口,我们引入负载感知全局分配策略,在预填充和解码阶段均衡数据并行单元间的计算负载。在部署Deepseek-V3的生产级H800集群上,相比最先进的即时调度基线,我们的系统将首词生成时间降低30%-40%,并将吞吐量提升15%-20%。