arxiv_cs_lg 2026年4月24日

競争相たからシナジーへ：主体驱动的画像生成における強化学習の解凍

From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

Translated: 2026/4/24 20:08:59

reinforcement-learningsubject-driven-image-generationgrpodiffusion-processarxiv

Japanese Translation

arXiv:2510.18263v2 Announce Type: replace 要旨: 主体驱动的画像生成モデルは、アイデンティティの保持（忠実性）とプロンプトへの適合（編集性）という基本的なトレードオフに直面しています。オンラインの強化学習（RL）、特に GPRO は有望な解決策を提供しますが、私たちは単純な GRPO の適用が競合的劣化を引き起こすことを発見しました。これは、単純な線形報酬の集約と固定された重数が衝突する勾配信号を生み出し、拡散過程の時間的動態と不一致を引き起こすためです。これらの限界を克服するために、私たちは 2 つの主要なイノベーションを備えた新しい枠組みである Customized-GRPO を提案します：(i) シナジーに敏感な報酬整形（SARS）、これは衝突する報酬信号を明確に罰し、シナジーのあるものを増幅する非線形メカニズムで、より鋭く決断的な勾配を提供します。(ii) 時間に意識的な動的ウェイト付け（TDW）、これはプロンプトの追従を早期段階で、アイデンティティの保持を後期段階で優先することで、最適化圧をモデルの時間的動態に整合させます。大規模な実験により、私たちの手法が-naive GRPO ベースラインを大幅に上回っていることが示され、競合的劣化が成功的に軽減されています。私たちのモデルは優越的なバランスを達成し、重要なアイデンティティ特徴を保持しつつ複雑なテキストプロンプトに正確に適合した画像を生成します。

Original Content

arXiv:2510.18263v2 Announce Type: replace Abstract: Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.