Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.
翻译:在共享GPU集群中部署多个模型有望提升大语言模型(LLM)服务的资源效率。现有的多LLM服务系统以牺牲推理性能(尤其是首词延迟)为代价优化GPU利用率。我们发现其根本原因在于这些系统对未来工作负载特性缺乏认知。相比之下,近期对实际场景追踪数据的分析表明,LLM服务负载具有高度周期性和长期可预测性。我们提出通用GPU工作节点的概念,通过基于未来负载知识的模型加载机制,实现一机多用的GPU预热。基于通用GPU工作节点,我们设计并构建了WarmServe——一个多LLM服务系统,其具备以下特性:(1)采用驱逐感知的模型放置策略以缓解集群级预热干扰;(2)通过主动预热机制预先准备通用GPU工作节点;(3)采用零开销内存切换机制管理GPU内存。在实际数据集上的评估表明,与当前最先进的基于自动扩缩容的系统相比,WarmServe将首词延迟提升最高达50.8倍,同时相较于GPU共享系统能够处理高达2.5倍的请求量。