During data analysis, we are often perplexed by certain disparities observed between two groups of interest within a dataset. To better understand an observed disparity, we need explanations that can pinpoint the data regions where the disparity is most pronounced, along with its causes, i.e., factors that alleviate or exacerbate the disparity. This task is complex and tedious, particularly for large and high-dimensional datasets, demanding an automatic system for discovering explanations (data regions and causes) of an observed disparity. It is critical that explanations for disparities are not only interpretable but also actionable-enabling users to make informed, data-driven decisions. This requires explanations to go beyond surface-level correlations and instead capture causal relationships. We introduce ExDis, a framework for discovering causal Explanations for Disparities between two groups of interest. ExDis identifies data regions (subpopulations) where disparities are most pronounced (or reversed), and associates specific factors that causally contribute to the disparity within each identified data region. We formally define the ExDis framework and the associated optimization problem, analyze its complexity, and develop an efficient algorithm to solve the problem. Through extensive experiments over three real-world datasets, we demonstrate that ExDis generates meaningful causal explanations, outperforms prior methods, and scales effectively to handle large, high-dimensional datasets.
翻译:在数据分析过程中,我们常常对数据集中两个感兴趣群体之间观察到的某些差异感到困惑。为了更好地理解观察到的差异,我们需要能够精确定位差异最显著的数据区域及其成因的解释,即缓解或加剧差异的因素。这项任务复杂且繁琐,特别是对于大规模高维数据集,需要一个自动系统来发现观察到的差异的解释(数据区域和成因)。关键的是,对差异的解释不仅要可解释,还要具有可操作性——使用户能够做出基于数据的明智决策。这要求解释超越表面相关性,转而捕捉因果关系。我们提出了ExDis框架,用于发现两个感兴趣群体之间差异的因果解释。ExDis识别出差异最显著(或逆转)的数据区域(子群体),并将每个识别出的数据区域内对差异有因果贡献的特定因素关联起来。我们正式定义了ExDis框架及相关优化问题,分析了其复杂性,并开发了一种高效算法来解决该问题。通过对三个真实世界数据集的广泛实验,我们证明ExDis能够生成有意义的因果解释,优于先前方法,并能有效扩展以处理大规模高维数据集。