Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.
翻译:大语言模型(LLMs)正日益广泛地部署于延迟敏感的在线服务与成本敏感的离线工作负载中。将这两类工作负载协同部署于共享的服务实例上能够提升资源利用率,但直接将此方法应用于Prefill/Decode(P/D)解耦系统会引入严重的负载不均衡问题,因为请求组合的波动会改变固有的P/D比率。现有的动态调整技术难以跟上在线服务突发流量的变化模式。本文提出一种延迟约束解耦架构,该架构根据任务延迟需求将集群资源划分为延迟严格池与延迟宽松池。此设计使得离线解码任务能够灵活调度,在保障在线性能的同时缓解P/D不均衡。为充分利用此灵活性,我们提出:(1)基于性能瓶颈的调度器,其以Roofline性能模型为指导进行调度;(2)快速抢占机制,严格保障在线请求的服务等级目标(SLOs)。基于真实场景数据集的实验表明,相较于现有离线系统方案,本方法在维持在线请求SLOs的同时,将离线吞吐量最高提升3倍。