Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the sources feeding the pipeline execution. In this paper, we present an indexing mechanism to efficiently capture and query pipeline provenance. Our solution leverages tensors to capture fine-grained provenance of data processing operations, using minimal memory. In addition to record-level lineage relationships, we provide finer granularity at the attribute level. This is achieved by augmenting tensors, which capture retrospective provenance, with prospective provenance information, drawing connections between input and output schemas of data processing operations. We demonstrate how these two types of provenance (retrospective and prospective) can be combined to answer a broad range of provenance queries efficiently, and show effectiveness through evaluation exercises using both real and synthetic data.
翻译:在数据准备流水线中,数据溯源具有广泛的应用价值。它可用于调试故障流水线、解释结果、验证公平性以及识别可能影响流水线执行源头的数据质量问题。本文提出一种索引机制,用于高效捕获与查询流水线溯源信息。我们的解决方案利用张量以最小内存开销捕获数据处理操作的细粒度溯源。除记录级谱系关系外,我们还提供属性级的更细粒度溯源。这是通过增强张量实现的:在捕获回顾式溯源的基础上,融入前瞻式溯源信息,从而建立数据处理操作输入与输出模式间的关联。我们论证了如何将这两种溯源类型(回顾式与前瞻式)结合以高效回答各类溯源查询,并通过真实数据与合成数据的评估实验验证了其有效性。