We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
翻译:本文提出Any4D,一种可扩展的多视角Transformer模型,用于度量尺度、密集馈送式四维重建。Any4D直接为N帧图像生成逐像素的运动与几何预测,与此前通常专注于双视角密集场景流或稀疏三维点跟踪的研究形成对比。此外,相较于其他基于单目RGB视频进行四维重建的最新方法,Any4D能够处理额外模态与传感器数据,如RGB-D帧、基于IMU的自体运动及雷达多普勒测量(若可用)。实现这一灵活框架的关键创新在于对四维场景的模块化表示:具体而言,每视角的四维预测通过多种以局部相机坐标系表示的自我中心因子(深度图与相机内参)和以全局世界坐标系表示的他者中心因子(相机外参与场景流)进行编码。我们在多种设置下均实现了卓越性能——无论是在精度(误差降低2-3倍)还是计算效率(提速15倍)方面,这为多个下游应用开辟了新的可能性。