The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.
翻译:随着支持100k+上下文长度的100B+参数大语言模型(LLMs)的普及,对支持大型键值缓存(KV cache)的片上内存需求日益增长。StreamingLLM和SnapKV等技术展示了如何在保持模型精度的同时控制KV缓存大小。然而,这些技术在使用vLLM或SGLang等框架的工业部署中并不常用。原因有二:一方面,这些框架采用的静态计算图和连续批处理方法使得难以对标准的多头注意力算法进行修改;另一方面,此类技术对现代指令跟随和推理模型精度的影响尚未得到充分理解,从而模糊了实施这些技术的必要性。本文中,我们以Llama-3.1-8B-Instruct和DeepSeek-R1为对象探究了这些精度影响,并开发了可大规模部署的KV缓存压缩方法SnapStream。我们在真实生产环境中展示了SnapStream的效能:在SambaNova SN40L加速器上以16路张量并行部署DeepSeek-671B模型,在128k上下文长度下运行,每秒可处理高达1832个token。SnapStream实现了$4\times$的片上内存使用效率提升,在LongBench-v2、AIME24和LiveCodeBench基准测试中仅引入微小的精度损失。据我们所知,这是在采用静态计算图和连续批处理的生产推理系统中部署稀疏KV注意力技术的首次实现。