arxiv_cs_lg 2026年4月20日

SOAR: 拡散モデルにおける最適な整合性とRefinementのための自己訂正

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

Translated: 2026/4/20 11:05:49

diffusion-modelsself-correctionpost-trainingreinforcement-learningalignment

Japanese Translation

arXiv:2604.12617v2 発表タイプ：置換摘要：現在、拡散モデルの事後訓練パイプラインは、厳選されたデータ上の教師あり微調整（SFT）と報酬モデルを伴う強化学習（RL）の二つの段階から構成されています。これら間には根本的なギャップが存在します。SFT は、前進ノイズ化プロセスからサンプリングされた真のGround-truth状態上でのみデノイサーを最適化しており、推論がこれらの理想状態から外れると、学習した修正ではなく分布外一般化に依存することになります。これは自己回帰モデルに見られる曝出バイアスを示しますが、トークンシーケンスではなくデノイズの軌跡で蓄積されます。RL は原理的にこの不整合を解決できるとされますが、その終期報酬信号はスピアであると共に、クレジット割り当ての困難に直面し、報酬ハッキングのリスクを伴います。私らは、このギャップを埋める偏差修正事後訓練法である SOAR（Self-Correction for Optimal Alignment and Refinement）を提案します。SOAR は実サンプルから始め、現在のモデルで単一の停止勾配ロールアウトを実行し、結果としての軌道外状態を再ノイズ化し、元のクリーンターゲットへと導くようにモデルを監督します。この方法はオン・ポリシーであり、報酬フリーで、クレジット割り当ての問題なしに時ステップごとの密度高い監督を提供します。SD3.5-Medium において、SOAR は SFT に対して GenEval を 0.70 から 0.78、OCR を 0.64 から 0.67 に向上させ、同時に全てのモデルベースの好意スコアを高めるものでした。制御された報酬固有の実験において、SOAR は報酬モデルへのアクセスがないにもかかわらず、美学とテキスト画像の整合性の両方のタスクで最終的なメトリック値において Flow-GRPO を上回りました。SOAR のベースロスは標準的な SFT オブジェクトを包括するため、事前学習の後、より強力な最初の事後訓練段階として SFT を直接置き換え、かつ後続の RL 整合性とも完全に互換性を持っています。

Original Content

arXiv:2604.12617v2 Announce Type: replace Abstract: The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.