利用测试驱动开发与大型语言模型实现可靠且可验证的电子表格代码生成：一个研究框架 (Leveraging Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework)

Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues such as hallucinations, subtle logical inconsistencies, and syntactic errors, risks particularly acute in high stakes domains like financial modelling and scientific computations, where accuracy and reliability are paramount. This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation to enhance the correctness of, reliability of, and user confidence in generated outputs. We hypothesise that a "test first" methodology provides both technical constraints and cognitive scaffolding, guiding LLM outputs towards more accurate, verifiable, and comprehensible solutions. Our framework, applicable across diverse programming contexts, from spreadsheet formula generation to scripting languages such as Python and strongly typed languages like Rust, includes an explicitly outlined experimental design with clearly defined participant groups, evaluation metrics, and illustrative TDD based prompting examples. By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement, particularly benefiting spreadsheet users who often lack formal programming training yet face serious consequences from logical errors. We invite collaboration to refine and empirically evaluate this approach, ultimately aiming to establish responsible and reliable LLM integration in both educational and professional development practices.

翻译：大型语言模型（LLMs），如ChatGPT，正日益被用于生成传统软件代码和电子表格逻辑。尽管这些模型展现出令人印象深刻的生成能力，但它们经常表现出关键问题，如幻觉、微妙的逻辑不一致和语法错误，这些风险在金融建模和科学计算等高风险领域中尤为突出，因为这些领域对准确性和可靠性要求极高。本立场论文提出一个结构化研究框架，将经过验证的软件工程实践——测试驱动开发（TDD）与大型语言模型（LLM）驱动的生成相结合，以增强生成输出的正确性、可靠性和用户信心。我们假设“测试优先”方法既提供技术约束，又提供认知支架，引导LLM输出更准确、可验证且易于理解的解决方案。我们的框架适用于多种编程环境，从电子表格公式生成到脚本语言（如Python）和强类型语言（如Rust），包括明确概述的实验设计，其中定义了清晰的参与者组、评估指标和基于TDD的提示示例。通过强调测试驱动思维，我们旨在提升计算思维、提示工程技能和用户参与度，尤其惠及那些通常缺乏正式编程培训但面临逻辑错误严重后果的电子表格用户。我们邀请合作来完善并实证评估这一方法，最终目标是在教育和专业开发实践中建立负责任且可靠的LLM集成。