There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
翻译:移动用户界面(UI)自动化因其在各行业的广泛应用而需求日益增长。随着视觉语言模型(VLMs)的出现,GUI自动化已从生成面向人类的文本指令发展为自主执行任务,从而优化自动化工作流程。近期方法利用VLMs解决此问题,因其具备以下能力:1)直接处理屏幕内容,2)通过利用人类操作(如点击、输入)保持与设备特定API的独立性,3)应用现实世界上下文知识进行任务理解。然而,由于视觉编码器特征中空间信息有限,这些模型常难以准确识别控件并确定操作。此外,性能顶尖的模型通常规模庞大,需要大量训练并导致推理延迟。本研究提出AFRAgent,一种基于instruct-BLIP的多模态架构,在GUI自动化中实现卓越性能,而其规模仅为最接近竞争对手的四分之一以下。为增强大型语言模型(LLM)流程中的图像嵌入,我们提出一种基于自适应特征重归一化(令牌级仿射变换)的技术,能有效丰富低分辨率图像嵌入并融合高分辨率细节。我们在Meta-GUI和AITW基准测试上评估AFRAgent,为智能手机自动化建立了新的最先进基线。