在Selenium测试框架中微调大型语言模型以实现Web浏览器自动表单交互 (Finetuning LLMs for Automatic Form Interaction on Web-Browser in Selenium Testing Framework)

Automated web application testing is a critical component of modern software development, with frameworks like Selenium widely adopted for validating functionality through browser automation. Among the essential aspects of such testing is the ability to interact with and validate web forms, a task that requires syntactically correct, executable scripts with high coverage of input fields. Despite its importance, this task remains underexplored in the context of large language models (LLMs), and no public benchmark or dataset exists to evaluate LLMs on form interaction generation systematically. This paper introduces a novel method for training LLMs to generate high-quality test cases in Selenium, specifically targeting form interaction testing. We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios. We define clear metrics for syntax correctness, script executability, and input field coverage. Our empirical study demonstrates that our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics. Our work lays the groundwork for future research on LLM-based web testing and provides resources to support ongoing progress in this area.

翻译：自动化Web应用测试是现代软件开发的关键组成部分，其中Selenium等框架通过浏览器自动化被广泛用于功能验证。此类测试的核心环节之一是能够与Web表单进行交互并验证其功能，这需要生成语法正确、可执行的脚本，并实现对输入字段的高覆盖率。尽管该任务至关重要，但在大型语言模型（LLMs）背景下仍缺乏深入探索，且目前尚无公开的基准测试或数据集来系统评估LLMs在表单交互生成方面的性能。本文提出了一种新颖的方法，通过训练LLMs生成高质量的Selenium测试用例，特别针对表单交互测试。我们构建了合成数据与人工标注数据集用于训练和评估，覆盖多样化的真实场景表单及测试情境。我们明确定义了语法正确性、脚本可执行性及输入字段覆盖率的评估指标。实证研究表明，我们的方法在所有评估指标上均显著优于包括GPT-4o及其他主流LLMs在内的强基线模型。本研究为基于LLM的Web测试未来研究奠定了基础，并为该领域的持续发展提供了资源支持。