The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-KVCache, a system tailored for managing the large-scale KVCache in LLM inference. Beluga-KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the vLLM inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.
翻译:LLM模型规模的快速增长以及对长上下文推理日益增长的需求,使得内存成为GPU加速服务系统中的关键瓶颈。尽管GPU上的高带宽内存(HBM)提供快速访问,但其有限容量迫使系统依赖主机内存(CPU DRAM)来支持更大的工作集,例如KVCache。然而,DRAM的最大容量受限于每个CPU插槽有限的内存通道数。为克服这一限制,现有系统常采用基于RDMA的解耦内存池,但这带来了高访问延迟、复杂通信协议和同步开销等重大挑战。幸运的是,新兴的CXL技术为KVCache设计带来了新机遇。本文提出Beluga,一种新颖的内存架构,使GPU和CPU能够通过CXL交换机访问共享的大规模内存池。通过支持CXL结构上的原生加载/存储访问语义,我们的设计实现了接近本地内存的延迟,同时降低了编程复杂度并最小化了同步开销。我们对基于商用CXL交换机的内存池进行了系统化表征,并提出了一套设计准则。基于Beluga,我们设计并实现了Beluga-KVCache,这是一个专为管理LLM推理中大规模KVCache而优化的系统。与基于RDMA的解决方案相比,Beluga-KVCache在vLLM推理引擎中实现了首词生成时间(TTFT)降低89.6%,吞吐量提升7.35倍。据我们所知,Beluga是首个使GPU能够通过CXL交换机直接访问大规模内存池的系统,标志着GPU实现低延迟、共享访问海量内存资源的重要进展。