SAM3-I：基于指令的通用分割模型 (SAM3-I: Segment Anything with Instructions)

Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.

翻译：Segment Anything Model 3（SAM3）通过可提示概念分割技术推进了开放词汇分割能力，允许用户分割与给定概念对应的所有实例，该概念通常通过简短的名词短语（NP）提示指定。尽管这是SAM系列首次集成语言层面的概念理解，但实际应用通常需要更丰富的表达形式，包括属性、空间关系、功能、动作、状态乃至对实例的隐式推理。目前，SAM3依赖外部多模态代理将复杂指令转换为名词短语，再进行迭代掩码过滤。然而，这种名词短语层面的概念表征仍过于粗略，往往无法精确表示特定实例。本研究提出SAM3-I，一个在SAM框架内统一概念级理解与指令级推理的增强架构。SAM3-I引入了指令感知的级联适配机制，逐步将富有表现力的指令语义与SAM3现有的视觉-语言表征对齐，使其能够在不牺牲原有概念驱动能力的前提下实现直接遵循指令的分割。此外，我们设计了涵盖概念级、简单级与复杂级的结构化指令分类体系，并开发了可扩展的数据引擎来构建包含多样化指令-掩码对的数据集。实验表明，SAM3-I展现出优异的性能，证明SAM3能够有效扩展至遵循自然语言指令的任务，同时保持其强大的概念基础。我们开源了SAM3-I并提供了实用的微调工作流程，使研究者能够将其适配至特定领域应用。源代码已公开。