arxiv_cs_cv 2026年2月10日

SIGMA: マルチ属性トークンを用いた選択的インターリーブ生成

SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

Translated: 2026/3/15 18:04:22

diffusion-modelsimage-generationmultimodal-aiimage-editingtransformer-architecture

Japanese Translation

arXiv:2602.07564v1 Announcement Type: new アブストラクト：最近の統一モデルである Bagel は、並列なイメージ編集データが単一の拡散トランスフォーマー内で複数の視覚タスクを効果的にアライメントできることを示しました。しかし、これらのモデルは単一条件入力に限定されており、複数の異性なソースから結果を合成するための柔軟性を欠いています。本研究では、拡散トランスフォーマー内でインターリーブされた多条件生成を可能にする統一されたポストトレーニングフレームワークである SIGMA（Selective-Interleaved Generation with Multi-Attribute Tokens）を提案します。SIGMA は、スタイル、コンテンツ、主役、アイデンティティなどの選択的マルチ属性トークンを導入し、モデルが多様な視覚条件をインターリーブされたテキスト - イメージシーケンスの中で解釈・構成できるようにします。Bagel 統一バックボーンを用いた 70 万例のインターリーブデータにポストトレーニングを行ったことで、SIGMA は構造化編集、選択的属性転送、微細なマルチモーダルアライメントをサポートします。大規模な実験により、SIGMA は多様な編集および生成タスクにおいて、制御性、クロス条件一貫性、視覚品質を向上させ、構造化タスクにおいて Bagel に比べて顕著な性能向上をもたらしたことが示されました。

Original Content

arXiv:2602.07564v1 Announce Type: new Abstract: Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.