Being able to evaluate the quality of a clustering result even in the absence of ground truth cluster labels is fundamental for research in data mining. However, most cluster validation indices (CVIs) do not capture noise assignments by density-based clustering methods like DBSCAN or HDBSCAN, even though the ability to correctly determine noise is crucial for successful clustering. In this paper, we propose DISCO, a Density-based Internal Score for Clusterings with nOise, the first CVI to explicitly assess the quality of noise assignments rather than merely counting them. DISCO is based on the established idea of the Silhouette Coefficient, but adopts density-connectivity to evaluate clusters of arbitrary shapes, and proposes explicit noise evaluation: it rewards correctly assigned noise labels and penalizes noise labels where a cluster label would have been more appropriate. The pointwise definition of DISCO allows for the seamless integration of noise evaluation into the final clustering evaluation, while also enabling explainable evaluations of the clustered data. In contrast to most state-of-the-art, DISCO is well-defined and also covers edge cases that regularly appear as output from clustering algorithms, such as singleton clusters or a single cluster plus noise.
翻译:在缺乏真实聚类标签的情况下评估聚类结果的质量,是数据挖掘领域研究的基础。然而,大多数聚类验证指标(CVIs)未能有效捕捉基于密度的聚类方法(如DBSCAN或HDBSCAN)中的噪声分配,而正确识别噪声对聚类成功至关重要。本文提出DISCO(含噪声密度聚类的内部评分指标),这是首个明确评估噪声分配质量而非仅统计噪声数量的CVI。DISCO基于经典的轮廓系数思想,但采用密度连通性评估任意形状的簇,并提出显式噪声评估机制:奖励正确分配的噪声标签,惩罚本应归属簇标签的误标噪声。DISCO的逐点定义实现了噪声评估与最终聚类评估的无缝集成,同时支持对聚类结果的可解释性分析。与当前主流方法相比,DISCO具有严谨的数学定义,能覆盖聚类算法输出的常见边缘情况,例如单例簇或单一簇加噪声的场景。