面向非结构化数据处理的多目标智能体重写方法 (Multi-Objective Agentic Rewrites for Unstructured Data Processing)

One year ago, we open-sourced DocETL, a declarative system for LLM-powered data processing that, as of November 2025, has 3.2K GitHub stars and users across domains (e.g., journalism, law, medicine, policy, finance, and urban planning). In DocETL, users build pipelines by composing operators described in natural language, also known as semantic operators, with an LLM executing each operator's logic. However, due to complexity in the operator or the data it operates on, LLMs often give inaccurate results. To address this challenge, DocETL introduced rewrite directives, or abstract rules that guide LLM agents in rewriting pipelines by decomposing operators or data. For example, decomposing a single filter("is this email sent from an executive and discussing fraud?") into the conjunction of two separate semantic filters may improve accuracy. However, DocETL only optimizes for accuracy, not cost. How do we optimize for both? We present MOAR (Multi-Objective Agentic Rewrites), a new optimizer for DocETL. To target cost optimization, we introduce two new categories of directives and extend all three existing categories with new ones, bringing the total to over 30 directives -- more than doubling what DocETL originally had. Moreover, since operators can interact with each other unpredictably due to LLM behavior, optimizing operators or sub-pipelines individually can yield suboptimal overall plans. Recognizing this, we design a new global search algorithm that explores rewrites in the context of entire pipelines. Since the space of rewrites is infinite -- pipelines can be rewritten in many ways, and each rewritten pipeline can itself be rewritten -- our algorithm adapts a multi-armed bandit framework to prioritize which pipelines to rewrite. Across six workloads, MOAR achieves 27% higher accuracy than ABACUS, the next-best optimizer, while matching its best accuracy at 55% of its cost.

翻译：一年前，我们开源了DocETL——一个基于大语言模型（LLM）的声明式数据处理系统。截至2025年11月，该项目已在GitHub上获得3.2K星标，用户遍布新闻、法律、医学、政策、金融及城市规划等多个领域。在DocETL中，用户通过组合以自然语言描述的语义算子来构建处理流水线，每个算子的逻辑由LLM执行。然而，由于算子本身或其处理数据的复杂性，LLM常产生不准确的结果。为应对这一挑战，DocETL引入了重写指令——即指导LLM智能体通过分解算子或数据来重写流水线的抽象规则。例如，将单个过滤器（“这封邮件是否来自高管且讨论欺诈？”）分解为两个独立语义过滤器的逻辑合取，可提升准确性。但DocETL仅针对准确性进行优化，未考虑成本。如何实现两者的协同优化？本文提出MOAR（多目标智能体重写）——DocETL的新型优化器。为实现成本优化，我们新增两类指令，并对现有三类指令进行扩展，使指令总数超过30条，较原始版本增加一倍以上。此外，由于LLM行为的不可预测性可能导致算子间产生意外交互，单独优化算子或子流水线可能产生次优的整体方案。鉴于此，我们设计了一种新的全局搜索算法，在全流水线上下文中探索重写方案。鉴于重写空间无限（流水线存在多种重写方式，且每次重写后均可继续重写），本算法采用多臂赌博机框架来优先选择待重写的流水线。在六组工作负载测试中，MOAR相比次优优化器ABACUS实现了27%的准确率提升，同时以55%的成本达到了与之相当的最佳准确率。