交错批次调度：协同优化首词生成时间与吞吐量以实现高效大语言模型推理 (Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference)

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.

翻译：大语言模型（LLM）服务向复杂分布式架构（特别是参数/执行分离的大规模数据并行+专家并行范式）的演进，带来了独特的调度挑战。与传统部署中调度器可将实例视为黑盒不同，数据并行+专家并行架构具有高昂的内部同步成本。我们发现，此类系统中立即请求分派会导致严重的引擎内部排队与并行化气泡，从而劣化首词生成时间。为解决该问题，我们提出交错批次调度机制，通过主动缓冲请求以形成最优执行批次。这种时间解耦在保持吞吐量的同时消除了内部排队气泡。此外，利用缓冲创建的调度窗口，我们引入负载感知全局分配策略，在预填充和解码阶段均衡数据并行单元间的计算负载。在部署Deepseek-V3的生产级H800集群上，相比最先进的即时调度基线，我们的系统将首词生成时间降低30%-40%，并将吞吐量提升15%-20%。