The challenge of \textbf{imbalanced regression} arises when standard Empirical Risk Minimization (ERM) biases models toward high-frequency regions of the data distribution, causing severe degradation on rare but high-impact ``tail'' events. Existing strategies uch as loss re-weighting or synthetic over-sampling often introduce noise, distort the underlying distribution, or add substantial algorithmic complexity. We introduce \textbf{PARIS} (Pruning Algorithm via the Representer theorem for Imbalanced Scenarios), a principled framework that mitigates imbalance by \emph{optimizing the training set itself}. PARIS leverages the representer theorem for neural networks to compute a \textbf{closed-form representer deletion residual}, which quantifies the exact change in validation loss caused by removing a single training point \emph{without retraining}. Combined with an efficient Cholesky rank-one downdating scheme, PARIS performs fast, iterative pruning that eliminates uninformative or performance-degrading samples. We use a real-world space weather example, where PARIS reduces the training set by up to 75\% while preserving or improving overall RMSE, outperforming re-weighting, synthetic oversampling, and boosting baselines. Our results demonstrate that representer-guided dataset pruning is a powerful, interpretable, and computationally efficient approach to rare-event regression.
翻译:当标准经验风险最小化(ERM)使模型偏向数据分布的高频区域时,会导致对罕见但高影响的“尾部”事件性能严重下降,从而引发**不平衡回归**的挑战。现有策略如损失重加权或合成过采样通常会引入噪声、扭曲底层分布或增加显著的算法复杂度。我们提出了**PARIS**(面向不平衡场景的基于表示定理的剪枝算法),这是一个通过**优化训练集本身**来缓解不平衡问题的原理性框架。PARIS利用神经网络的表示定理计算**闭式表示器删除残差**,该残差量化了移除单个训练点(**无需重新训练**)所引起的验证损失的确切变化。结合高效的Cholesky秩一降阶更新方案,PARIS执行快速迭代剪枝,消除无信息或导致性能下降的样本。我们使用一个真实世界的空间天气示例,其中PARIS将训练集缩减高达75%,同时保持或改善整体均方根误差(RMSE),其性能优于重加权、合成过采样和提升基线方法。我们的结果表明,基于表示定理指导的数据集剪枝是一种强大、可解释且计算高效的处理罕见事件回归的方法。