Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G and 91.5% on ScreenSpot-v2. Project page: https://github.com/sjz5202/GUI-AIMA
翻译:图形用户界面(GUI)定位是计算机使用代理的关键功能,它将自然语言指令映射到可操作的屏幕区域。现有基于多模态大语言模型(MLLMs)的方法通常将其视为基于文本的坐标生成任务,但直接从视觉输入生成精确坐标仍然具有挑战性且计算密集。实现GUI定位的一种直观方法是先选择与指令相关的视觉块,然后在这些块内确定精确的点击位置。基于观察到通用MLLMs在其注意力机制中蕴含一定的原生定位能力,我们提出了GUI-AIMA,一种基于注意力且无需坐标的监督微调框架,用于高效GUI定位。GUI-AIMA将MLLMs的内在多模态注意力与块级定位信号对齐。这些信号通过对简化的查询-视觉注意力矩阵进行多头聚合,针对多样化的用户指令自适应计算。此外,其无需坐标的方式可轻松集成即插即用的放大阶段。GUI-AIMA-3B仅使用85,000张截图进行训练,展现出卓越的数据效率,并验证了轻量训练即可激发MLLMs的原生定位能力。它在3B模型中实现了最先进的性能,在ScreenSpot-Pro上的平均准确率达到59.6%,在OSWorld-G上达到63.8%,在ScreenSpot-v2上达到91.5%。项目页面:https://github.com/sjz5202/GUI-AIMA