Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.
翻译:与人类视觉一致的感知相似度评分对于训练和评估计算机视觉模型至关重要。深度感知损失(如LPIPS)实现了良好的对齐,但依赖于复杂、高度非线性的判别性特征,其不变性未知;而手工设计的度量(如SSIM)虽可解释,却缺失关键的感知特性。我们提出了结构化不确定性相似度评分(SUSS);它通过一组感知组件对每幅图像进行建模,每个组件由一个结构化多元正态分布表示。这些组件以生成式、自监督的方式训练,以赋予人类不可察觉的增强图像高似然度。最终评分是组件对数概率的加权和,权重从人类感知数据集中学习得到。与基于特征的方法不同,SUSS学习像素空间中残差的图像特定线性变换,通过去相关残差和采样实现透明检查。SUSS与人类感知判断高度一致,在多种失真类型上表现出强大的感知校准能力,并为其相似性评估提供局部化、可解释的解释。我们进一步展示了将SUSS用作下游成像任务的感知损失时,具有稳定的优化行为和竞争性性能。