即时视频模型：用于稳定基于图像网络的通用适配器 (Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks)

When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

翻译：当逐帧应用于视频时，基于帧的网络常表现出时间不一致性——例如，输出在帧间闪烁。当网络输入包含时变退化时，此问题会加剧。在本工作中，我们提出了一种通用方法，用于调整基于帧的模型，以实现视频上稳定且鲁棒的推理。我们描述了一类稳定性适配器，它们可插入几乎任何架构中，以及一种资源高效的训练过程，该过程可在基础网络冻结的情况下执行。我们引入了一个统一的概念框架来描述时间稳定性和退化鲁棒性，其核心是提出的准确性-稳定性-鲁棒性损失。通过分析该损失的理论特性，我们确定了其产生良好稳定器训练的条件。我们的实验在多个视觉任务上验证了该方法，包括去噪（NAFNet）、图像增强（HDRNet）、单目深度估计（Depth Anything v2）和语义分割（DeepLabv3+）。我们的方法提高了时间稳定性，并增强了对一系列图像退化（包括压缩伪影、噪声和恶劣天气）的鲁棒性，同时保持或提升了预测质量。