Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.
翻译:许多前沿的大语言模型被训练为在给出答案前进行思考。推理能显著提升语言模型的能力与安全性,但也降低了其交互性:面对新输入时,模型必须完成思考才能响应。现实应用场景(如语音助手或嵌入式助手)要求大语言模型智能体实时响应并适应新增信息,这与顺序交互模式存在固有矛盾。相比之下,人类能够异步地聆听、思考与行动:我们在阅读问题时即开始思考,并在组织答案时持续思考。本研究通过增强具备推理能力的大语言模型,使其无需额外训练即可实现类似异步运作。本方法利用旋转位置编码的特性,使原本设计用于顺序交互的大语言模型能够同步执行思考、聆听与生成输出。我们在数学推理、常识推理及安全推理任务上评估该方法,发现其能实时生成准确的思维增强型答案,将首个非思考标记的生成时间从数分钟缩短至≤5秒,并将整体实时延迟降低6-11倍。