Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.
翻译:人类通过观察、与环境交互,并内化物理规律与因果关系进行学习。本文旨在探讨智能体是否能够通过类似交互方式习得类人推理能力,并随着经验积累持续提升。为此,我们提出了包含1000余款异构游戏的Game-to-Unseen(G2U)基准测试集,这些游戏展现出显著的视觉域差异。现有方法(包括视觉语言模型与世界模型)因未聚焦核心机制且过度拟合视觉细节,难以捕捉底层物理规律与因果关系:VLM/VLA智能体虽能推理,但在交互场景中缺乏前瞻性;世界模型虽能模拟想象,却倾向于模仿视觉模式而非分析物理与因果机制。为此,我们提出IPR(交互式物理推理器),利用世界模型推演轨迹对VLM策略进行评分与强化,并引入PhysCode——一种以物理为中心的动作编码,通过将语义意图与动力学对齐,为预测与推理提供共享动作空间。在1000余款游戏上预训练后,我们的IPR在从原始直觉到目标驱动推理的各级任务中均表现稳健,整体性能甚至超越GPT-5。研究发现,模型性能随训练游戏数量与交互步数增加而提升,且能零样本迁移至未见过的游戏。这些结果验证了以物理为中心的交互是实现持续提升物理推理能力的有效路径。更多演示与项目细节请访问:https://mybearyzhang.github.io/ipr-1。