The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.
翻译:多模态大语言模型(MLLMs)的出现推动了基于纯视觉输入在图形用户界面(GUIs)上运行的自主智能体的发展。一个根本性挑战在于如何稳健地实现自然语言指令的基础化。这需要精确的空间对齐(准确定位每个元素的坐标)以及更为关键的语义对齐(将指令与功能上合适的用户界面元素相匹配)。尽管带有可验证奖励的强化学习(RLVR)已被证明能有效改善这些MLLMs的空间对齐能力,但我们发现低效的探索过程会阻碍语义对齐,使模型难以学习复杂的语义关联。为解决这一探索问题,我们提出了自适应探索策略优化(AEPO),一种新的策略优化框架。AEPO采用多答案生成策略以强制进行更广泛的探索,并通过一个基于效率第一性原理η=U/C推导出的、具有理论依据的自适应探索奖励(AER)函数进行引导。我们通过AEPO训练的模型——InfiGUI-G1-3B和InfiGUI-G1-7B——在多个具有挑战性的GUI基础任务基准测试中取得了新的最优结果,在旨在测试泛化能力和语义理解的基准测试上,相较于朴素的RLVR基线实现了高达9.0%的相对显著提升。相关资源可在 https://github.com/InfiXAI/InfiGUI-G1 获取。