Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
翻译:利用大型扩散模型进行免训练图像编辑已变得实用,但忠实执行复杂的非刚性编辑(如姿态或形状改变)仍极具挑战。我们发现一个关键的根本原因:现有注意力共享机制中的注意力崩溃现象,即位置嵌入或语义特征在视觉内容检索中占据主导,导致过度编辑或编辑不足。为解决此问题,我们提出SynPS方法,该方法协同利用位置嵌入和语义信息以实现忠实的非刚性图像编辑。我们首先提出一种编辑度量,用于量化每个去噪步骤所需的编辑幅度。基于此度量,我们设计了一个注意力协同流程,动态调节位置嵌入的影响,使SynPS能够平衡语义修改与保真度保持。通过自适应整合位置与语义线索,SynPS有效避免了过度编辑和编辑不足。在公开及新构建的基准测试上进行的大量实验证明了我们方法的卓越性能和忠实度。