声音场景空间语义分割的度量分析 (Metric Analysis for Spatial Semantic Segmentation of Sound Scenes)

Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. To evaluate S5 systems, one can consider two individual metrics, i.e., one for source separation and another for sound event classification, but this approach makes it challenging to compare S5 systems. Thus, a joint class-aware signal-to-distortion ratio (CA-SDR) metric was proposed to evaluate S5 systems. In this work, we first compare the CA-SDR with the classical SDR on scenarios with only classification errors. We then analyze the cases where the metric might not allow proper comparison of the systems. To address this problem, we propose a modified version of the CA-SDR which first focuses on class-agnostic SDR and then accounts for the wrongly labeled sources. We also analyze the performance of the two metrics under cross-contamination between separated audio sources. Finally, we propose a first set of penalties in an attempt to make the metric more reflective of the labeling and separation errors.

翻译：声音场景的空间语义分割（S5）涉及从多通道音频混合信号中同时执行音频源分离与声音事件分类。为评估S5系统，传统上可分别采用两个独立度量指标——一个用于源分离，另一个用于声音事件分类，但该方法难以实现S5系统间的直接比较。为此，研究者提出了联合类感知信噪比（CA-SDR）作为S5系统的统一评估指标。本研究首先在仅存在分类误差的场景下对比CA-SDR与经典SDR的性能差异，继而分析该指标可能导致系统评估失准的边界情况。针对此问题，我们提出改进版CA-SDR度量方法：该方法首先计算与类别无关的通用SDR，再对错误标记的声源进行校正处理。此外，我们系统分析了两种度量指标在分离音频源存在交叉污染时的表现特性。最后，我们首次引入系列惩罚机制，旨在使度量结果更精确地反映标记误差与分离误差的综合影响。