通过测试时缩放探究世界模型在空间推理中的有效性 (Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling)

Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

翻译：视觉语言模型（VLMs）在需要多视角理解和具身视角转换的空间推理任务中仍存在局限。近期方法如MindJourney试图通过测试时缩放来弥补这一差距，即利用世界模型想象动作条件轨迹，并通过启发式验证器从这些轨迹中选择有益视角。在本研究中，我们系统性地考察了此类测试时验证器在不同基准测试中的表现，揭示了其潜力与缺陷。基于不确定性的分析表明，MindJourney的验证器几乎未提供有意义的校准，且随机评分通常同样能降低答案熵，从而暴露出系统性的动作偏差和不可靠的奖励信号。为缓解这些问题，我们提出了基于空间断言的验证（ViSA）框架，将测试时奖励锚定于可验证的、帧对齐的微观主张。这种原则性验证器在SAT-Real基准上持续提升了空间推理能力，并通过更均衡的探索行为纠正了轨迹选择偏差。然而，在具有挑战性的MMSI-Bench基准上，包括我们方法在内的所有验证器均未能实现一致的缩放效果，表明当前世界模型形成了信息瓶颈——想象视角未能增强细粒度推理能力。综合来看，这些发现揭示了基于世界模型的推理中测试时验证的负面、积极与棘手方面。代码已开源：https://github.com/chandar-lab/visa-for-mindjourney。