The proliferation of large language models has driven demand for long-context inference on resource-constrained edge platforms. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to architectural mismatch: the quadratic complexity of standard attention conflicts with NPU memory and compute patterns. This paper presents a comprehensive performance analysis of causal inference operators on a modern NPU, benchmarking quadratic attention against sub-quadratic alternatives including structured state-space models and causal convolutions. Our analysis reveals a spectrum of critical bottlenecks: quadratic attention becomes severely memory-bound with catastrophic cache inefficiency, while sub-quadratic variants span from compute-bound on programmable vector cores to memory-bound by data movement. These findings provide essential insights for co-designing hardware-aware models and optimization strategies to enable efficient long-context inference on edge platforms.
翻译:大规模语言模型的激增推动了资源受限边缘平台上长上下文推理的需求。然而,由于架构不匹配,将这些模型部署在神经处理单元(NPU)上面临重大挑战:标准注意力机制的二次复杂度与NPU的内存和计算模式相冲突。本文对现代NPU上的因果推理算子进行了全面的性能分析,将二次注意力与包括结构化状态空间模型和因果卷积在内的次二次替代方案进行了基准测试。我们的分析揭示了一系列关键瓶颈:二次注意力因灾难性的缓存低效性而变得严重受内存限制,而次二次变体则从可编程向量核心上的计算受限延伸到由数据移动导致的内存受限。这些发现为协同设计硬件感知模型和优化策略提供了重要见解,以实现在边缘平台上高效的长上下文推理。