When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.
翻译:当逐帧应用于视频时,基于帧的网络常表现出时间不一致性——例如,输出在帧间闪烁。当网络输入包含时变退化时,此问题会加剧。在本工作中,我们提出了一种通用方法,用于调整基于帧的模型,以实现视频上稳定且鲁棒的推理。我们描述了一类稳定性适配器,它们可插入几乎任何架构中,以及一种资源高效的训练过程,该过程可在基础网络冻结的情况下执行。我们引入了一个统一的概念框架来描述时间稳定性和退化鲁棒性,其核心是提出的准确性-稳定性-鲁棒性损失。通过分析该损失的理论特性,我们确定了其产生良好稳定器训练的条件。我们的实验在多个视觉任务上验证了该方法,包括去噪(NAFNet)、图像增强(HDRNet)、单目深度估计(Depth Anything v2)和语义分割(DeepLabv3+)。我们的方法提高了时间稳定性,并增强了对一系列图像退化(包括压缩伪影、噪声和恶劣天气)的鲁棒性,同时保持或提升了预测质量。