Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works, our solution achieves an average EDP reduction of 63.12%.
翻译:大型语言模型(LLMs)带来了巨大的计算需求,推动了可扩展多芯片加速器的发展。然而,现有针对此类加速器的映射空间探索工作主要集中于传统的CNN/Transformer负载,未能充分支持实际LLM推理服务中混合请求类型和可变序列长度的动态行为。为弥补这一不足,我们首先提出了一种基于计算执行图的映射编码方案,该方案将微批次与网络层解耦,实现了在异构芯片上的细粒度执行控制,并能灵活表示多种并行策略。其次,基于此方案,我们开发了Compass框架,该框架集成了评估引擎和基于遗传算法的映射生成引擎,以实现高效的映射搜索。与现有先进工作相比,我们的解决方案平均实现了63.12%的能耗延迟积(EDP)降低。