The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first \textit{formalize multi-priority request scheduling as a service gain maximization problem}, where satisfying latency requirements for requests of different priorities contributes varying levels of gain. We then propose PROSERVE, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine level, SlideBatching dynamically adapts batch formation and request ordering under varying load conditions, employing a sliding boundary mechanism to balance deadline-first and density-first strategies. At the service level, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation across four open-source datasets and a real-world industrial trace demonstrates that \systemname{} consistently outperforms state-of-the-art baselines, improving system gain by up to 35% and boosting SLO attainment by up to 52%.
翻译:大型语言模型在交互式应用中的广泛部署,要求服务系统能够同时处理数千个具有不同服务水平目标要求的并发请求。在此背景下,一个关键但常被忽视的维度是客户端之间固有的优先级差异;例如,业务关键功能需要更高的性能保证,因为满足此类请求能产生显著更大的业务价值。然而,现有的大语言模型服务调度器未能同时优化SLO达成率和客户端优先级。为弥补这一差距,我们首先将多优先级请求调度形式化为一个服务收益最大化问题,其中满足不同优先级请求的延迟要求会贡献不同水平的收益。随后,我们提出了PROSERVE,一个旨在最大化整体服务收益的统一双层调度框架。在引擎层面,SlideBatching在变化的负载条件下动态调整批处理形成和请求排序,采用滑动边界机制来平衡截止时间优先和密度优先策略。在服务层面,GoRouting在分布式实例间执行面向收益且能力感知的调度,主动为未来的高优先级或长请求预留容量。在四个开源数据集和一个真实工业追踪数据上的广泛评估表明,该系统始终优于最先进的基线方法,将系统收益提升高达35%,并将SLO达成率提高高达52%。