Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (\textsc{AnomalyCD}), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The \textsc{AnomalyCD} presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of of the approach on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly data sets. Source code: https://github.com/muleina/AnomalyCD .
翻译:在监控系统检测到系统故障后,提取异常因果关系有助于故障诊断。识别大规模系统中的异常原因涉及对多个子系统中更广泛的监控变量进行调查。然而,学习图形因果模型(GCMs)伴随着显著的计算负担,限制了大多数现有方法在实时和大规模部署中的适用性。此外,现代大规模系统的监控应用通常生成大量二进制报警标志,而二进制异常数据的独特特征——状态转换的含义和数据稀疏性——对现有的因果关系学习机制构成了挑战。本研究提出了一种异常因果发现方法(\\textsc{AnomalyCD}),旨在解决从时序二进制标志数据集中生成GCMs时面临的准确性和计算挑战。\\textsc{AnomalyCD}提出了多种策略,例如异常数据感知的因果性检验、稀疏数据与先验链接压缩,以及边剪枝调整方法。我们在两个数据集上验证了该方法的性能:欧洲核子研究中心紧凑μ子螺线管实验读出箱系统的监控传感器数据,以及一个用于信息技术监控的公开数据集。时序GCMs的结果表明,该方法在二进制异常数据集上显著降低了计算开销,并适度提升了准确性。源代码:https://github.com/muleina/AnomalyCD 。