arxiv_cs_gr 2026年4月24日

LooseRoPE: 文脈感知アテンション操作による意味の調和

LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization

Translated: 2026/4/24 19:53:54

LooseRoPEdiffusion-modelsattention-mechanismimage-editingpositional-encoding

Japanese Translation

arXiv:2601.05127v2 発表形式: 差し替え要約：最新のディフューションに基づく画像編集方法の多くは、テキストまたは高レベルの指示に依存しており、直感的な制御を提供しつつも粗い制御である。一方、私どもは、ユーザがオブジェクトまたはサブオブジェクトを画像内の選択された位置へクリップして貼り付けることで直接編集内容を指定する、明示的でテキスト不要の編集に焦点を当てています。この操作は正確な空間的および視覚的控制を提供しますが、基本的な課題としての、貼り付けられたオブジェクトのアイデンティティ維持と、その新たなコンテキストとの調和の両立を生み出します。私どもは、ディフューションベースの編集モデルにおけるアテンションマップが、画像のどの領域が維持されるか、あるいは調和のために適応されるかを内在的に制御していることを観察しました。この洞察に基づき、私どもは、回転位置エンコーディング（RoPE）の_salience_ガイドされたモジュレーションである LooseRoPE を提案しました。これは、位置の制約を緩和することでアテンションフィールドの視野を連続的に制御します。このような方式で RoPE を緩和することで、私どもの方法は、入力画像の忠実な保持と挿入オブジェクトの調和的な統合の間にモデルのフォーカスを滑らかに誘導し、アイデンティティの保持と文脈のブレンドの間のバランスの取れたトレードオフを可能にします。私どものアプローチは、テキストの説明や複雑なユーザ入力なしで、シームレスな構成結果をもたらす画像編集のための柔軟で直感的なフレームワークを提供します。

Original Content

arXiv:2601.05127v2 Announce Type: replace Abstract: Recent diffusion-based image editing methods commonly rely on text or high-level instructions to guide the generation process, offering intuitive but coarse control. In contrast, we focus on explicit, prompt-free editing, where the user directly specifies the modification by cropping and pasting an object or sub-object into a chosen location within an image. This operation affords precise spatial and visual control, yet it introduces a fundamental challenge: preserving the identity of the pasted object while harmonizing it with its new context. We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE, a saliency-guided modulation of rotational positional encoding (RoPE) that loosens the positional constraints to continuously control the attention field of view. By relaxing RoPE in this manner, our method smoothly steers the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object, enabling a balanced trade-off between identity retention and contextual blending. Our approach provides a flexible and intuitive framework for image editing, achieving seamless compositional results without textual descriptions or complex user input.