Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%-40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis.
翻译:评估随机对照试验的偏倚风险对于可信的证据综合至关重要,但该过程资源密集且易受评审者间差异影响。大型语言模型为实现自动化提供了途径,但现有方法依赖于人工设计的提示,难以复制、泛化或评估。本研究引入了一种程序化的偏倚风险评估流程,利用DSPy及其GEPA模块,以基于代码的结构化优化取代临时性提示设计。GEPA通过帕累托引导搜索优化LLM推理,并生成可检查的执行轨迹,从而支持优化过程中每一步的透明复现。我们在来自已发表荟萃分析的100项随机对照试验上,针对七个偏倚风险领域评估了该方法。GEPA生成的提示被应用于开源模型(Mistral Small 3.1与GPT-oss-20b)和商业模型(GPT-5 Nano与GPT-5 Mini)。在方法学报告更清晰的领域(如随机序列生成),GEPA生成的提示表现最佳,分配隐藏和参与者盲法领域结果相似,而商业模型整体表现略优。我们还使用Claude 3.5 Sonnet将GEPA与三种人工设计的提示进行了比较。GEPA实现了最高的整体准确率,在随机序列生成和选择性报告领域的性能提升了30%-40%,在其他领域相对于人工提示展现出总体可比且具有竞争力的表现。这些发现表明,GEPA能够为偏倚风险评估生成一致且可复现的提示,支持在证据综合中结构化、原则性地使用大型语言模型。