Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that 'splices out' the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.
翻译:现代神经网络常在编码任务相关信息的同时嵌入不必要的概念,引发公平性与可解释性问题。现有事后处理方法虽能移除不良概念,但常导致有用信号衰减。本文提出SPLINCE——线性概念移除与协方差保持的同步投影方法,该方法可从表征中消除敏感概念,同时精确保持其与目标标签的协方差。SPLINCE通过斜投影实现这一目标,该投影能'剪除'非期望方向,同时保护重要的标签相关性。理论上,该方法是唯一能在消除线性概念可预测性的同时,以最小嵌入失真维持目标协方差的解。实证研究表明,在Bias in Bios和Winobias等基准测试中,SPLINCE优于基线方法,在移除受保护属性的同时对主任务信息损伤最小。