Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R
翻译:当前动态场景中的密集三维点跟踪方法通常依赖于成对处理、需要已知相机位姿,或假设输入帧具有时序顺序,这限制了其灵活性和适用性。此外,近期研究已成功实现了从大规模、无位姿图像集合中进行高效三维重建,突显了动态场景理解统一方法的机遇。受此启发,我们提出DePT3R,一种新颖的框架,能够在单次前向传播中从多幅图像同时执行动态场景的密集点跟踪和三维重建。该多任务学习通过使用强大的骨干网络提取深度时空特征,并利用密集预测头回归逐像素映射来实现。关键的是,DePT3R无需相机位姿即可运行,显著增强了其适应性和效率——这在快速变化的动态环境中尤为重要。我们在多个涉及动态场景的挑战性基准测试上验证了DePT3R,证明了其强大的性能,并在内存效率上相比现有最先进方法有显著提升。数据和代码可通过开放仓库获取:https://github.com/StructuresComp/DePT3R