Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
翻译:人类通过构建持久、结构化的内部表征来感知和推理其周围环境,这些表征编码了语义含义、空间布局和时间动态,从而在四个维度上进行认知。这些多模态记忆使得人类能够回忆过去事件、推断未观测状态,并将新信息整合到上下文相关的推理中。受此能力启发,我们提出了R4,一个用于四维时空空间中检索增强推理的无训练框架,为视觉语言模型(VLMs)配备了结构化、终身化的记忆。R4通过在度量空间和时间中锚定对象级语义描述,持续构建四维知识数据库,形成一个可在多个智能体间共享的持久世界模型。在推理时,自然语言查询被分解为语义、空间和时间键,以检索相关观测结果,并将其整合到VLM的推理过程中。与经典的检索增强生成方法不同,R4中的检索直接在四维空间中进行,无需训练即可实现情景式和协作式推理。在具身问答和导航基准测试上的实验表明,相较于基线方法,R4在时空信息的检索和推理方面显著提升,为动态环境中的具身四维推理推进了新的范式。