With the increasing adoption of large language models (LLM), agentic workflows, which compose multiple LLM calls with tools, retrieval, and reasoning steps, are increasingly replacing traditional applications. However, such workflows are inherently error-prone: incorrect or partially correct output at one step can propagate or even amplify through subsequent stages, compounding the impact on the final output. Recent work proposes integrating verifiers that validate LLM output or actions, such as self-reflection, debate, or LLM-as-a-judge mechanisms. Yet, verifying every step introduces significant latency and cost overheads. In this work, we seek to answer three key questions: which nodes in a workflow are most error-prone and thus deserve costly verification, how to select the most appropriate verifier for each node, and how to use verification with minimal impact to latency? Our solution, Sherlock, addresses these using counterfactual analysis on agentic workflows to identify error-prone nodes and selectively attaching cost-optimal verifiers only where necessary. At runtime, Sherlock speculatively executes downstream tasks to reduce latency overhead, while verification runs in the background. If verification fails, execution is rolled back to the last verified output. Compared to the non-verifying baseline, Sherlock delivers an 18.3% accuracy gain on average across benchmarks. Sherlock reduces workflow execution time by up to 48.7% over non-speculative execution and lowers verification cost by 26.0% compared to the Monte Carlo search-based method, demonstrating that principled, fault-aware verification effectively balances efficiency and reliability in agentic workflows.
翻译:随着大语言模型(LLM)的日益普及,由多个LLM调用与工具、检索及推理步骤组合而成的智能体工作流正逐步取代传统应用。然而,此类工作流本质上容易出错:单个步骤中不正确或部分正确的输出可能传播甚至通过后续阶段放大,从而对最终结果产生累积性影响。近期研究提出集成验证器以校验LLM输出或行为,例如自反思、辩论或LLM作为评判机制。但验证每个步骤会引入显著的延迟与成本开销。本研究旨在回答三个关键问题:工作流中哪些节点最易出错因而值得进行高成本验证?如何为每个节点选择最合适的验证器?以及如何在最小化延迟影响的前提下实施验证?我们的解决方案Sherlock通过智能体工作流的反事实分析来识别易错节点,并仅在必要时选择性附加成本最优的验证器。在运行时,Sherlock通过推测式执行下游任务以降低延迟开销,同时验证过程在后台运行。若验证失败,执行将回滚至最近已验证的输出。相较于无验证基线,Sherlock在基准测试中平均实现18.3%的准确率提升。相比非推测式执行,Sherlock将工作流执行时间降低达48.7%;与基于蒙特卡洛搜索的方法相比,验证成本降低26.0%。这表明基于原则的故障感知验证机制能有效平衡智能体工作流的效率与可靠性。