DINO-Foresight：利用DINO预见未来 (DINO-Foresight: Looking into the Future with DINO)

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .

翻译：预测未来动态对于自动驾驶和机器人等应用至关重要，其中理解环境是关键。现有的像素级方法计算成本高昂，且常关注无关细节。为应对这些挑战，我们提出了DINO-Foresight，这是一种在预训练视觉基础模型（VFMs）的语义特征空间中运行的新型框架。我们的方法以自监督方式训练掩码特征Transformer，以预测VFM特征随时间的演化。通过预测这些特征，我们可以应用现成的、针对特定任务的头部模块来处理各种场景理解任务。在此框架中，VFM特征被视为一个潜在空间，不同的头部模块可附加其上，以执行未来帧分析的具体任务。大量实验表明，我们的框架具有非常强大的性能、鲁棒性和可扩展性。项目页面和代码位于https://dino-foresight.github.io/。