AFRAgent：一种基于自适应特征重归一化的高分辨率感知图形用户界面代理 (AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent)

There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

翻译：移动用户界面（UI）自动化因其在各行业的广泛应用而需求日益增长。随着视觉语言模型（VLMs）的出现，GUI自动化已从生成面向人类的文本指令发展为自主执行任务，从而优化自动化工作流程。近期方法利用VLMs解决此问题，因其具备以下能力：1）直接处理屏幕内容，2）通过利用人类操作（如点击、输入）保持与设备特定API的独立性，3）应用现实世界上下文知识进行任务理解。然而，由于视觉编码器特征中空间信息有限，这些模型常难以准确识别控件并确定操作。此外，性能顶尖的模型通常规模庞大，需要大量训练并导致推理延迟。本研究提出AFRAgent，一种基于instruct-BLIP的多模态架构，在GUI自动化中实现卓越性能，而其规模仅为最接近竞争对手的四分之一以下。为增强大型语言模型（LLM）流程中的图像嵌入，我们提出一种基于自适应特征重归一化（令牌级仿射变换）的技术，能有效丰富低分辨率图像嵌入并融合高分辨率细节。我们在Meta-GUI和AITW基准测试上评估AFRAgent，为智能手机自动化建立了新的最先进基线。