Tree ensembles have demonstrated state-of-the-art predictive performance across a wide range of problems involving tabular data. Nevertheless, the black-box nature of tree ensembles is a strong limitation, especially for applications with critical decisions at stake. The Hoeffding or ANOVA functional decomposition is a powerful explainability method, as it breaks down black-box models into a unique sum of lower-dimensional functions, provided that input variables are independent. In standard learning settings, input variables are often dependent, and the Hoeffding decomposition is generalized through hierarchical orthogonality constraints. Such generalization leads to unique and sparse decompositions with well-defined main effects and interactions. However, the practical estimation of this decomposition from a data sample is still an open problem. Therefore, we introduce the TreeHFD algorithm to estimate the Hoeffding decomposition of a tree ensemble from a data sample. We show the convergence of TreeHFD, along with the main properties of orthogonality, sparsity, and causal variable selection. The high performance of TreeHFD is demonstrated through experiments on both simulated and real data, using our treehfd Python package (https://github.com/ThalesGroup/treehfd). Besides, we empirically show that the widely used TreeSHAP method, based on Shapley values, is strongly connected to the Hoeffding decomposition.
翻译:树集成模型在处理表格数据的一系列问题上已展现出最先进的预测性能。然而,其黑箱特性构成显著局限,尤其在涉及关键决策的应用场景中。Hoeffding(或称ANOVA)函数分解是一种强大的可解释性方法,当输入变量相互独立时,可将黑箱模型唯一分解为低维函数的和。在标准学习场景中,输入变量通常存在依赖性,此时需通过分层正交约束对Hoeffding分解进行推广。该推广能产生具有明确定义的主效应与交互作用的唯一稀疏分解。然而,如何从数据样本中实际估计此类分解仍是待解难题。为此,我们提出TreeHFD算法,用于从数据样本中估计树集成模型的Hoeffding分解。我们证明了TreeHFD算法的收敛性,并阐明了其正交性、稀疏性与因果变量选择的核心性质。通过使用我们开发的treehfd Python工具包(https://github.com/ThalesGroup/treehfd)在模拟数据与真实数据上的实验,验证了TreeHFD算法的高效性能。此外,我们通过实证表明,广泛使用的基于Shapley值的TreeSHAP方法与Hoeffding分解存在紧密关联。