Hogwild!推理：通过并发注意力实现并行大语言模型生成 (Hogwild! Inference: Parallel LLM Generation via Concurrent Attention)

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's memory. Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

翻译：大语言模型（LLMs）通过高级推理、长文本内容生成和工具使用，已展现出处理日益复杂任务的能力。解决这些任务通常涉及较长的推理时间计算。在人类问题求解中，加速工作的常见策略是协作：通过将问题分解为子任务、并行探索不同策略等方式。近期研究表明，LLMs也可以通过实施显式协作框架实现并行操作，例如投票机制或显式创建可并行执行的独立子任务。然而，这些框架可能并不适用于所有类型的任务，从而限制了其适用性。在本研究中，我们提出了一种不同的设计方法：并行运行LLM“工作器”，允许它们通过并发更新的注意力缓存进行同步，并提示这些工作器自主决定最佳协作方式。我们的方法使LLM实例能够针对当前问题自主制定协作策略，同时通过并发键值缓存“感知”彼此的记忆状态。我们通过Hogwild!推理实现该方法：这是一个并行LLM推理引擎，其中多个相同LLM实例在共享注意力缓存的条件下并行运行，并拥有对彼此记忆的“即时”访问权限。Hogwild!推理利用旋转位置编码（RoPE）避免重复计算，同时提升并行硬件利用率。我们发现，具备推理能力的现代LLMs无需额外微调即可直接支持共享键值缓存推理。