Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.
翻译:全色锐化的目标是通过融合高分辨率全色(PAN)图像及其对应的低分辨率多光谱(MS)图像,生成高分辨率多光谱(HRMS)图像。为实现有效融合,充分利用两种模态间的互补信息至关重要。传统的基于CNN的方法通常依赖于固定卷积算子的通道级联,这限制了其对多样化空间与光谱变化的适应性。虽然交叉注意力机制能够实现全局交互,但其计算效率较低,且可能稀释细粒度对应关系,从而难以捕捉复杂的语义关联。多模态扩散Transformer(MMDiT)架构的最新进展在图像生成与编辑任务中已展现出显著成功。与交叉注意力不同,MMDiT采用上下文条件化机制,以促进更直接、高效的跨模态信息交换。本文提出MMMamba,一种用于全色锐化的跨模态上下文融合框架,并具备以零样本方式支持图像超分辨率的灵活性。基于Mamba架构构建,我们的设计在保持强大跨模态交互能力的同时,确保了线性计算复杂度。此外,我们引入了一种新颖的多模态交错(MI)扫描机制,以促进PAN与MS模态间的有效信息交换。大量实验表明,在多个任务与基准测试中,我们的方法相较于现有最先进(SOTA)技术均表现出卓越性能。