面向图像编辑的组相对注意力引导方法 (Group Relative Attention Guidance for Image Editing)

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

翻译：近年来，基于扩散-Transformer模型的图像编辑技术发展迅速。然而，现有编辑方法往往缺乏对编辑程度的有效控制，限制了其实现更定制化结果的能力。为应对这一局限，我们研究了DiT模型中的MM-Attention机制，观察到Query和Key令牌共享一个仅与层相关的偏置向量。我们将此偏置解释为模型固有的编辑行为表征，而每个令牌与其对应偏置之间的差值则编码了内容特定的编辑信号。基于这一发现，我们提出了组相对注意力引导方法——一种简单而有效的技术，通过重新加权不同令牌的差值来调节模型对输入图像相对于编辑指令的关注度，从而无需任何调参即可实现连续、细粒度的编辑强度控制。在现有图像编辑框架上进行的大量实验表明，GRAG仅需四行代码即可集成，并能持续提升编辑质量。此外，与常用的无分类器引导相比，GRAG能实现更平滑、更精确的编辑程度控制。我们的代码将在https://github.com/little-misfit/GRAG-Image-Editing发布。