DSeq-JEPA：判别式序列联合嵌入预测架构 (DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture)

Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

翻译：基于图像的联合嵌入预测架构（I-JEPA）通过从可见上下文预测掩码区域的潜在嵌入来学习视觉表示。然而，该方法对所有区域进行统一且独立的处理，缺乏关于预测应在何处或以何种顺序进行的明确概念。受人类视觉感知的启发——人类注意力会从信息最丰富的区域到次要区域进行选择性、序列化的部署——我们提出了DSeq-JEPA，一种判别式序列联合嵌入预测架构，它桥接了预测式与自回归式自监督学习，将JEPA风格的潜在预测与GPT风格的序列推理相结合。具体而言，DSeq-JEPA（i）首先基于Transformer衍生的显著性图识别主要的判别性区域，强调视觉重要性的分布，然后（ii）按照此判别性顺序预测后续区域，逐步形成从主要线索到次要线索的类课程式语义递进——这是一种GPT风格的预训练形式。在包括图像分类（ImageNet）、细粒度视觉分类（iNaturalist21、CUB-200-2011、Stanford-Cars）、检测与分割（MS-COCO、ADE20K）以及低级推理任务（Clevr/Count、Clevr/Dist）在内的多种任务上进行的大量实验表明，DSeq-JEPA始终比I-JEPA变体更专注于更具判别性和泛化性的表示。项目页面：https://github.com/SkyShunsuke/DSeq-JEPA。