Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.
翻译:大型推理模型(LRMs)通过测试时扩展(如生成更长的思维链或采样多个解决方案)实现了较高的准确性,但代价是高昂的令牌消耗和延迟。我们认为记忆是高效推理的核心要素:当证据已存在时,模型应通过复用结构化记忆而非重新计算推导来减少思考。我们提出了ENGRAM-R,一种推断时记忆层,它将类型化检索与紧凑的事实卡片表示及显式引用控制相结合。在LoCoMo基准测试中,与完整上下文相比,ENGRAM-R减少了85%的输入令牌和75%的推理令牌,同时保持了高准确率。在LongMemEval基准测试的多跳子集上,它实现了相似的效率,并显著提升了准确率。这些结果表明,记忆不仅对长程推理的正确性至关重要,也是在计算、内存和延迟严格受限条件下实现高效推理的实用杠杆。