Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
翻译:自监督预训练已在语言、单张二维图像和视频的基础模型中引发革命,但从多视角图像中学习三维感知表示的方法仍鲜有探索。本文提出E-RayZer,一种自监督的大规模三维视觉模型,能够直接从无标注图像中学习真正三维感知的表示。与RayZer等先前通过潜在空间视图合成间接推断三维结构的自监督方法不同,E-RayZer直接在三维空间中运行,通过显式几何进行自监督三维重建。这一设计消除了捷径解,并产生几何基础扎实的表示。为确保收敛性和可扩展性,我们引入了一种新颖的细粒度学习课程,以从易到难的方式组织训练样本,并以完全无监督的方式协调异构数据源。实验表明,E-RayZer在姿态估计任务上显著优于RayZer,并在三维重建任务中匹配甚至有时超越全监督模型(如VGGT)。此外,当迁移至三维下游任务时,其学习到的表示优于领先的视觉预训练模型(如DINOv3、CroCo v2、VideoMAE V2和RayZer),确立了E-RayZer作为三维感知视觉预训练的新范式。