Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.
翻译:现实世界中的视觉数据很少以孤立的静态实例呈现,而是通常通过姿态、光照、物体状态或场景上下文的变化随时间逐渐演化。然而,传统分类器通常在时间独立性的假设下进行训练,限制了其捕捉此类动态的能力。我们提出了一种简单而有效的框架,为标准的全连接分类器配备了时间推理能力,且无需修改模型架构或引入循环模块。我们方法的核心是一种新颖的支持-范例-查询(SEQ)学习范式,它将训练数据组织成时间上连贯的轨迹。这些轨迹使模型能够学习类别特定的时间原型,并通过可微分的软动态时间规整(soft-DTW)损失对齐预测序列。一个多目标项进一步促进语义一致性和时间平滑性。通过将输入序列解释为演化的特征轨迹,我们的方法仅通过损失设计就引入了强时间归纳偏置。这在静态和时序任务中均被证明非常有效:它提升了细粒度和超细粒度图像分类的性能,并在视频异常检测中提供了精确且时间一致的预测。尽管方法简单,我们的方法以模块化和数据高效的方式桥接了静态与时序学习,仅需在预提取特征之上使用简单的分类器即可。