Global cloud service providers handle inference workloads for Large Language Models (LLMs) that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, resulting in diverse and often conflicting Service Level Agreement (SLA) requirements. Managing such mixed workloads is challenging due to the complexity of the inference serving stack, which encompasses multiple models, GPU hardware, and global data centers. Existing solutions often silo such fast and slow tasks onto separate GPU resource pools with different SLAs, but this leads to significant under-utilization of expensive accelerators due to load mismatch. In this article, we characterize the LLM serving workloads at Microsoft Office 365, one of the largest users of LLMs within Microsoft Azure cloud with over 10 million requests per day, and highlight key observations across workloads in different data center regions and across time. This is one of the first such public studies of Internet-scale LLM workloads. We use these insights to propose SageServe, a comprehensive LLM serving framework that dynamically adapts to workload demands using multi-timescale control knobs. It combines short-term request routing to data centers with long-term scaling of GPU VMs and model placement with higher lead times, and co-optimizes the routing and resource allocation problem using a traffic forecast model and an Integer Linear Programming (ILP) solution. We evaluate SageServe through real runs and realistic simulations on 10 million production requests across three regions and four open-source models. We achieve up to 25% savings in GPU-hours compared to the current baseline deployment and reduce GPU-hour wastage due to inefficient auto-scaling by 80%, resulting in a potential monthly cost savings of up to $2.5 million, while maintaining tail latency and meeting SLAs.
翻译:全球云服务提供商处理的大语言模型(LLM)推理工作负载涵盖延迟敏感型(如聊天机器人)与延迟不敏感型(如报告撰写)任务,导致多样且常相互冲突的服务等级协议(SLA)需求。由于推理服务栈的复杂性(涉及多种模型、GPU硬件及全球数据中心),管理此类混合工作负载极具挑战。现有方案通常将快慢任务隔离至具有不同SLA的独立GPU资源池,但这会因负载不匹配导致昂贵加速器利用率严重不足。本文以微软Azure云中LLM最大用户之一——Microsoft Office 365(日均请求超1000万次)为例,刻画其LLM服务负载特征,并揭示不同数据中心区域及时段工作负载的关键观察。这是首批针对互联网规模LLM负载的公开研究之一。基于这些洞察,我们提出SageServe——一个全面的LLM服务框架,通过多时间尺度控制机制动态适应负载需求。该框架结合短期请求路由至数据中心与长期GPU虚拟机扩缩容及高前置时间的模型部署,并利用流量预测模型与整数线性规划(ILP)求解器协同优化路由与资源分配问题。我们在跨三个区域、四个开源模型的1000万次生产请求上,通过实际运行与真实模拟评估SageServe。相比当前基线部署,我们实现了GPU时数最高25%的节约,并将低效自动扩缩容导致的GPU时数浪费降低80%,预计每月可节省高达250万美元成本,同时维持尾部延迟并满足SLA要求。