Provenance plays a crucial role in scientific workflow execution, for instance by providing data for failure analysis, real-time monitoring, or statistics on resource utilization for right-sizing allocations. The workflows themselves, however, become increasingly complex in terms of involved components. Furthermore, they are executed on distributed cluster infrastructures, which makes the real-time collection, integration, and analysis of provenance data challenging. Existing provenance systems struggle to balance scalability, real-time processing, online provenance analytics, and integration across different components and compute resources. Moreover, most provenance solutions are not workflow-aware; by focusing on arbitrary workloads, they miss opportunities for workflow systems where optimization and analysis can exploit the availability of a workflow specification that dictates, to some degree, task execution orders and provides abstractions for physical tasks at a logical level. In this paper, we present HyProv, a hybrid provenance management system that combines centralized and federated paradigms to offer scalable, online, and workflow-aware queries over workflow provenance traces. HyProv uses a centralized component for efficient management of the small and stable workflow-specification-specific provenance, and complements this with federated querying over different scalable monitoring and provenance databases for the large-scale execution logs. This enables low-latency access to current execution data. Furthermore, the design supports complex provenance queries, which we exemplify for the workflow system Airflow in combination with the resource manager Kubernetes. Our experiments indicate that HyProv scales to large workflows, answers provenance queries with sub-second latencies, and adds only modest CPU and memory overhead to the cluster.
翻译:溯源在科学工作流执行中扮演着关键角色,例如为故障分析、实时监控或资源利用统计以优化资源配置提供数据支撑。然而,工作流本身所涉及的组件日益复杂,且通常在分布式集群基础设施上执行,这使得溯源数据的实时采集、整合与分析面临挑战。现有溯源系统难以在可扩展性、实时处理、在线溯源分析以及跨组件与计算资源的集成之间取得平衡。此外,多数溯源解决方案缺乏对工作流的感知能力;由于专注于任意负载,它们未能充分利用工作流系统的特性——在这些系统中,优化与分析可借助工作流规约(该规约在一定程度上规定了任务执行顺序,并在逻辑层面为物理任务提供抽象)带来的机会。本文提出HyProv,一种混合式溯源管理系统,结合集中式与联邦式架构,实现对工作流溯源轨迹的可扩展、在线且感知工作流的查询。HyProv采用集中式组件高效管理规模小且稳定的工作流规约相关溯源数据,并辅以对大规模执行日志的联邦式查询(跨不同可扩展监控与溯源数据库),从而实现对当前执行数据的低延迟访问。此外,该设计支持复杂的溯源查询,我们以工作流系统Airflow结合资源管理器Kubernetes为例进行了演示。实验表明,HyProv能够扩展至大规模工作流,以亚秒级延迟响应溯源查询,且仅为集群带来适度的CPU与内存开销。