Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.
翻译:后训练范式的最新进展在高层级生成任务中取得了显著成功,但其在低层级视觉任务中的潜力仍鲜有探索。现有的图像复原(IR)方法依赖于对真实图像的像素级硬拟合,常受限于过度平滑和泛化能力差的问题。为应对这些局限,我们提出了IRPO,一种基于低层级GRPO的后训练范式,系统性地探索了数据构建与奖励建模。我们首先探索了适用于低层级后训练范式的数据构建原则,即从预训练阶段选择表现欠佳的样本,可获得最优性能并提升效率。此外,我们构建了一个奖励级标准体系,通过三个互补组件平衡客观精度与人类感知偏好:用于结构保真度的通用奖励、利用Qwen-VL实现感知对齐的专家奖励,以及针对任务特定低层级质量的复原奖励。在六个域内和五个域外(OOD)低层级基准测试上的综合实验表明,IRPO在多种退化类型上均达到了最先进的性能,在域内任务上超越AdaIR基线0.83 dB,在OOD设置上超越3.43 dB。我们的代码可见于https://github.com/HaoxuanXU1024/IRPO。