差异趋势的因果解释：何处与为何？ (Causal Explanations for Disparate Trends: Where and Why?)

During data analysis, we are often perplexed by certain disparities observed between two groups of interest within a dataset. To better understand an observed disparity, we need explanations that can pinpoint the data regions where the disparity is most pronounced, along with its causes, i.e., factors that alleviate or exacerbate the disparity. This task is complex and tedious, particularly for large and high-dimensional datasets, demanding an automatic system for discovering explanations (data regions and causes) of an observed disparity. It is critical that explanations for disparities are not only interpretable but also actionable-enabling users to make informed, data-driven decisions. This requires explanations to go beyond surface-level correlations and instead capture causal relationships. We introduce ExDis, a framework for discovering causal Explanations for Disparities between two groups of interest. ExDis identifies data regions (subpopulations) where disparities are most pronounced (or reversed), and associates specific factors that causally contribute to the disparity within each identified data region. We formally define the ExDis framework and the associated optimization problem, analyze its complexity, and develop an efficient algorithm to solve the problem. Through extensive experiments over three real-world datasets, we demonstrate that ExDis generates meaningful causal explanations, outperforms prior methods, and scales effectively to handle large, high-dimensional datasets.

翻译：在数据分析过程中，我们常常对数据集中两个感兴趣群体之间观察到的某些差异感到困惑。为了更好地理解观察到的差异，我们需要能够精确定位差异最显著的数据区域及其成因的解释，即缓解或加剧差异的因素。这项任务复杂且繁琐，特别是对于大规模高维数据集，需要一个自动系统来发现观察到的差异的解释（数据区域和成因）。关键的是，对差异的解释不仅要可解释，还要具有可操作性——使用户能够做出基于数据的明智决策。这要求解释超越表面相关性，转而捕捉因果关系。我们提出了ExDis框架，用于发现两个感兴趣群体之间差异的因果解释。ExDis识别出差异最显著（或逆转）的数据区域（子群体），并将每个识别出的数据区域内对差异有因果贡献的特定因素关联起来。我们正式定义了ExDis框架及相关优化问题，分析了其复杂性，并开发了一种高效算法来解决该问题。通过对三个真实世界数据集的广泛实验，我们证明ExDis能够生成有意义的因果解释，优于先前方法，并能有效扩展以处理大规模高维数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日