Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: https://github.com/microsoft/G-KV.
翻译:近期推理大语言模型在复杂任务中表现卓越,但面临长序列带来的显著计算与内存挑战。键值缓存压缩已成为大幅提升推理效率的有效途径。然而,现有方法多集中于提示压缩或基于局部注意力得分的令牌淘汰,忽视了令牌的长期重要性。本文提出G-KV——一种采用全局评分机制的键值缓存淘汰方法,通过融合局部与历史注意力得分更精准评估令牌重要性。此外,我们引入强化学习与知识蒸馏等后训练技术,以优化模型在压缩键值缓存环境下的性能。本文代码已开源:https://github.com/microsoft/G-KV。