KVSwap：面向长上下文设备端推理的磁盘感知型键值缓存卸载技术 (KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference)

Language models (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis, which often require processing multiple long-context inputs. Running an LM locally on-device improves privacy, enables offline use, and reduces cost, but long-context inference quickly hits a \emph{memory capacity wall} as the key-value (KV) cache grows linearly with context length and batch size. Existing KV-cache offloading schemes are designed to transfer cache data from GPU memory to CPU memory; however, they are not suitable for embedded and mobile systems, where the CPU and GPU (or NPU) typically share a unified memory and the non-volatile secondary storage (disk) offers limited I/O bandwidth. We present KVSwap, a software framework tailored for local devices that achieves high memory efficiency while effectively leveraging disk storage. KVSwap stores the full cache on disk, uses highly compact in-memory metadata to predict which entries to preload, overlaps computation with hardware-aware disk access, and orchestrates read patterns to match storage device characteristics. Our evaluation shows that across representative LMs and storage types, KVSwap delivers higher throughput under tight memory budgets while maintaining generation quality over existing KV cache offloading schemes.

翻译：语言模型（LMs）是支撑新兴移动与嵌入式人工智能应用（如会议与视频摘要、文档分析）的核心技术，这些应用通常需要处理多个长上下文输入。在设备端本地运行语言模型可提升隐私性、支持离线使用并降低成本，但长上下文推理会迅速遭遇“内存容量瓶颈”，因为键值（KV）缓存随上下文长度与批处理大小呈线性增长。现有的KV缓存卸载方案专为将缓存数据从GPU内存转移至CPU内存而设计；然而，它们并不适用于嵌入式与移动系统，此类系统中CPU与GPU（或NPU）通常共享统一内存，且非易失性二级存储（磁盘）的I/O带宽有限。本文提出KVSwap，一个专为本地设备定制的软件框架，在高效利用磁盘存储的同时实现高内存效率。KVSwap将完整缓存存储于磁盘，采用高度紧凑的内存元数据预测需预加载的条目，通过硬件感知的磁盘访问实现计算重叠，并协调读取模式以匹配存储设备特性。评估结果表明，在代表性语言模型与存储类型中，KVSwap在严格内存限制下能提供更高的吞吐量，同时在生成质量上优于现有KV缓存卸载方案。