内存受限系统上基于MoE的大型语言模型高效CPU-GPU协同推理框架 (Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems)

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from CPU multithreading optimizations. The evaluations of our framework demonstrate performance improvements and highlight the potential of CPU-GPU collaboration to maximize hardware utilization for single-request inference scenarios on consumer-grade systems. The implementation of our framework is available at https://github.com/elsa-lab/MoE-CPU-GPU-Collaborative-Inference.

翻译：大型语言模型（LLMs）已在各类任务中取得显著成果，但其高计算需求给部署带来挑战，尤其在消费级硬件上。混合专家（MoE）模型通过选择性激活参数子集提供了高效解决方案，从而降低计算需求。尽管具备效率优势，当前最先进的MoE模型仍需要超出典型消费级GPU容量的显存。传统卸载方法在CPU与GPU间传输模型权重会引入延迟，限制推理性能。本文提出一种新型CPU-GPU协同推理框架，该框架在GPU端集成专家缓存机制以减少数据传输需求，并通过缓存命中实现加速推理。计算任务通过CPU多线程优化进行高效处理以应对缓存未命中情况。实验评估表明，该框架在消费级系统的单请求推理场景中能提升性能，并凸显了CPU-GPU协同最大化硬件利用率的潜力。本框架实现代码已发布于 https://github.com/elsa-lab/MoE-CPU-GPU-Collaborative-Inference。