反事实的争议归属 (Discriminative Attribution from Counterfactuals)

We present a method for neural network interpretability by combining feature attribution with counterfactual explanations to generate attribution maps that highlight the most discriminative features between pairs of classes. We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner, thus preventing potential observer bias. We evaluate the proposed method on three diverse datasets, including a challenging artificial dataset and real-world biological data. We show quantitatively and qualitatively that the highlighted features are substantially more discriminative than those extracted using conventional attribution methods and argue that this type of explanation is better suited for understanding fine grained class differences as learned by a deep neural network.

翻译：我们提出一种神经网络可解释的方法,将特性归属与反事实解释结合起来,绘制说明不同类别之间最有区别的属性图,显示这种方法可用于客观地对特性归属方法的性能进行定量评估,从而防止潜在的观察者偏差。我们评估了三种不同数据集的拟议方法,包括具有挑战性的人工数据集和现实世界生物数据。我们从数量和质量上表明,突出特征比使用传统属性方法提取的特征更具歧视性,并主张这种解释更适合于了解深层神经网络所学的细微粒级差异。