Hierarchical clustering seeks to uncover nested structures in data by constructing a tree of clusters, where deeper levels reveal finer-grained relationships. Traditional methods, including linkage approaches, face three major limitations: (i) they always return a hierarchy, even if none exists, (ii) they are restricted to binary trees, even if the true hierarchy is non-binary, and (iii) they are highly sensitive to the choice of linkage function. In this paper, we address these issues by introducing the notion of a valid hierarchy and defining a partial order over the set of valid hierarchies. We prove the existence of a finest valid hierarchy, that is, the hierarchy that encodes the maximum information consistent with the similarity structure of the data set. In particular, the finest valid hierarchy is not constrained to binary structures and, when no hierarchical relationships exist, collapses to a star tree. We propose a simple two-step algorithm that first constructs a binary tree via a linkage method and then prunes it to enforce validity. We establish necessary and sufficient conditions on the linkage function under which this procedure exactly recovers the finest valid hierarchy, and we show that all linkage functions satisfying these conditions yield the same hierarchy after pruning. Notably, classical linkage rules such as single, complete, and average satisfy these conditions, whereas Ward's linkage fails to do so.
翻译:层次聚类旨在通过构建聚类树来揭示数据中的嵌套结构,其中更深层次展现出更细粒度的关联关系。传统方法(包括链接方法)面临三个主要局限:(i)即使不存在层次结构,它们也总是返回一个层次树;(ii)即使真实层次结构是非二叉的,它们仍受限于二叉树;(iii)对链接函数的选择高度敏感。本文通过引入有效层次结构的概念,并在有效层次结构集合上定义偏序关系,以解决这些问题。我们证明了最精细有效层次结构的存在性,即该层次结构编码了与数据集相似性结构一致的最大信息量。特别地,最精细有效层次结构不受限于二叉结构,且当不存在层次关系时,会退化为星形树。我们提出一种简单的两步算法:首先通过链接方法构建二叉树,然后通过剪枝强制满足有效性条件。我们建立了链接函数所需满足的充要条件,使得该过程能精确恢复最精细有效层次结构,并证明所有满足这些条件的链接函数在剪枝后均产生相同的层次结构。值得注意的是,经典链接规则(如单链接、全链接和平均链接)均满足这些条件,而Ward链接法则无法满足。