Theory of Mind (ToM) refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
翻译:心理理论(Theory of Mind, ToM)指推断他人心理状态(如信念、欲望和意图)的能力。当前的视觉语言具身智能体缺乏基于心理理论的决策能力,且现有基准仅关注人类心理状态而忽略智能体自身视角,阻碍了连贯的决策与行动生成。为此,我们提出MindPower——一个集成感知、心理推理、决策与行动的机器人中心框架。给定多模态输入,MindPower首先感知环境与人类状态,随后执行心理理论推理以建模自我与他人,最终根据推断的心理状态生成决策与行动。此外,我们引入Mind-Reward这一新颖优化目标,激励视觉语言模型产生一致的心理理论推理与行为。我们的模型在决策制定上优于GPT-4o达12.77%,在行动生成上优于12.49%。