Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 32B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.
翻译:近期,大型推理模型(如OpenAI的o1/o3和DeepSeek-R1)通过类人的审慎思维和长链推理,在专业推理任务中展现出卓越性能。然而,我们对多个模型系列(DeepSeek、Qwen和LLaMA)及规模(7B至32B)的系统评估表明,获取这些审慎推理能力会显著削弱LRMs的基础能力,包括助益性和无害性的明显下降,同时推理成本大幅增加。重要的是,我们证明自适应推理——采用零思考、少思考和总结思考等模式——能有效缓解这些缺陷。我们的实证发现强调了开发更具通用性的LRMs的迫切需求,这些模型应能根据具体任务特征动态分配推理计算资源。