Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.
翻译:大语言模型(LLMs)在各类应用中取得了前所未有的成功,但其巨大的内存需求对当前内存系统设计提出了严峻挑战,尤其在推理阶段。本研究针对基于末级缓存(LLC)的架构,包括GPU(如NVIDIA GPU)和AI加速器。我们提出了LLaMCAT,一种优化LLM推理中LLC性能的创新方法。LLaMCAT结合了缺失状态保持寄存器(MSHR)感知与负载均衡感知的缓存仲裁机制,以及线程节流技术,以应对KV缓存访问中严格的带宽需求并减少缓存停滞。我们还提出了一种混合仿真框架,通过内存轨迹将分析模型与周期级模拟器相结合,在架构细节与效率之间取得平衡。实验表明,当系统主要受限于缺失处理吞吐量时,LLaMCAT实现了平均1.26倍的加速比,而基线方法因未针对此场景优化大多呈现负提升。在缓存容量同时受限的情况下,我们的策略相比未优化版本实现了1.58倍的加速比,较最佳基线(dyncta)提升1.26倍。总体而言,LLaMCAT首次针对LLM解码特有的MSHR竞争问题——这一先前研究的空白——提出了解决方案,为未来硬件平台上加速LLM推理提供了实用途径。