元推理器：大型语言模型中优化推理时推理的动态引导 (Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models)

Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi-step reasoning tasks. While recent advancements on prompting and post-training have enabled LLMs to perform step-wise reasoning, they still tend to explore unproductive solution paths without effective backtracking or strategy adjustment. In this paper, we propose Meta-Reasoner, a new framework that empowers LLMs to "think about how to think". It optimizes the inference process by dynamically adapting reasoning strategies in real-time. Our approach employs contextual multi-armed bandits (CMABs) to learn an adaptive policy. It learns to evaluate the current state of LLM's reasoning and determine optimal strategy that is most likely to lead to a successful outcome during inference, like whether to backtrack, switch to a new approach, or restart the problem-solving process. This meta-guidance helps avoid unproductive paths exploration during inference and hence improves computational efficiency. We evaluate Meta-Reasoner on math problems (e.g., Game-of-24, TheoremQA) and scientific tasks (e.g., SciBench). Results show that our method outperform previous SOTA methods by 9-12\% in accuracy, while reducing inference time by 28-35\% under the same compute budget. Additional experiments on creative writing demonstrate the generalizability of our approach to diverse reasoning-intensive tasks.

翻译：大型语言模型（LLMs）在多步推理任务中常面临计算效率低下和错误传播的问题。尽管近期在提示和训练后优化方面的进展使LLMs能够执行逐步推理，但它们仍倾向于探索无效的解决路径，缺乏有效的回溯或策略调整。本文提出元推理器（Meta-Reasoner），一种新框架，使LLMs能够“思考如何思考”。它通过实时动态调整推理策略来优化推理过程。我们的方法采用上下文多臂老虎机（CMABs）来学习自适应策略，该策略评估LLM推理的当前状态，并确定在推理过程中最有可能导向成功结果的最优策略，例如是否回溯、切换新方法或重启问题解决过程。这种元引导有助于避免推理期间探索无效路径，从而提高计算效率。我们在数学问题（如Game-of-24、TheoremQA）和科学任务（如SciBench）上评估元推理器。结果显示，在相同计算预算下，我们的方法在准确率上比先前SOTA方法高出9-12%，同时推理时间减少28-35%。在创意写作上的额外实验证明了我们方法对多样化推理密集型任务的泛化能力。