Compass：面向LLM推理服务负载的多芯片加速器映射空间探索 (Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads)

Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works, our solution achieves an average EDP reduction of 63.12%.

翻译：大型语言模型（LLMs）带来了巨大的计算需求，推动了可扩展多芯片加速器的发展。然而，现有针对此类加速器的映射空间探索工作主要集中于传统的CNN/Transformer负载，未能充分支持实际LLM推理服务中混合请求类型和可变序列长度的动态行为。为弥补这一不足，我们首先提出了一种基于计算执行图的映射编码方案，该方案将微批次与网络层解耦，实现了在异构芯片上的细粒度执行控制，并能灵活表示多种并行策略。其次，基于此方案，我们开发了Compass框架，该框架集成了评估引擎和基于遗传算法的映射生成引擎，以实现高效的映射搜索。与现有先进工作相比，我们的解决方案平均实现了63.12%的能耗延迟积（EDP）降低。