End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
翻译:端到端神经日志系统生成帧级别的概率性说话人活动估计,但由于评估主要关注日志错误率,这些置信度分数的可靠性与校准问题在很大程度上被忽视。在融合多个日志系统时,DOVER-Lap 仍是唯一成熟的方法,其基于片段级别进行硬决策操作。我们提出利用连续概率输出,这支持更先进的融合与校准技术,能够利用模型不确定性及不同架构间的互补优势。本文首次提出了在概率级别校准与融合端到端神经日志模型的完整框架。我们研究了两种输出形式(多标签与幂集表示)及其对校准与融合效果的影响。通过在 CallHome 双说话人基准上的大量实验,我们证明适当的校准即使对单个模型也能带来显著改进(相对日志错误率降低高达 19%),在某些情况下可弥补领域自适应缺失的影响。我们发现,幂集空间中的联合校准始终优于独立的每说话人校准;融合显著优于单个模型;且“先融合后校准”的顺序通常优于先校准后融合及未校准的融合,同时仅需校准单个组合模型。我们的最佳配置在日志错误率方面优于 DOVER-Lap,并提供下游应用所需的可信置信度估计。本研究提出了端到端神经日志系统概率级别融合的最佳实践,并论证了利用软输出相较于硬决策的优势。