Prune4Web：面向Web智能体的DOM树剪枝编程 (Prune4Web: DOM Tree Pruning Programming for Web Agent)

Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.

翻译：Web自动化通过智能体模拟人类与Web界面的交互来执行高级任务。尽管近期基于大语言模型（LLM）的Web智能体展现出强大能力，但由于文档对象模型（DOM）结构规模通常高达10,000至100,000个标记，在复杂真实网页中实现高效导航仍是重大挑战。现有策略通常依赖粗略的DOM截断——可能丢失关键信息——或采用低效启发式方法与独立排序模型，难以在精确性与可扩展性间取得平衡。为应对这些挑战，我们提出Prune4Web这一新范式，将DOM处理从资源密集的LLM读取转变为高效的程序化剪枝。其核心是DOM树剪枝编程技术：通过LLM生成可执行的Python评分脚本，依据分解子任务中的语义线索动态筛选DOM元素。该机制无需LLM直接处理原始海量DOM，而是将遍历与评分任务委托给轻量级、可解释的程序。此方法将待定位候选元素减少25至50倍，从而在实现精确动作定位的同时缓解注意力稀释问题。此外，我们设计了专用数据标注流程与双轮对话训练策略，在统一框架内联合优化规划器、程序化过滤器与定位器。大量实验验证了其领先性能：在低层级定位任务中，Prune4Web将准确率从46.8%显著提升至88.28%，彰显了其在真实Web自动化场景中的卓越效能。