arxiv_cs_lg 2026年4月24日

ParetoSlider: 連続報酬制御のための拡散モデルの後訓練

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Translated: 2026/4/24 20:00:57

diffusion-modelsreinforcement-learningmulti-objective-optimizationgenerative-aiarxiv

Japanese Translation

arXiv:2604.20816v1 発表型: 新しいアブストラクト: 強化学習 (RL) の後訓練が、生成モデルを人間好みに合わせるための標準的アプローチとなりましたが、多くの手法が単一のスカラー報酬に依存しています。複数の基準が存在する場合、既存の「初期スカラー化」のプラクティスは、訓練時に固定的な重み付けの合計に報酬を崩壊させます。これにより、モデルは訓練時に単一のトレードオフ点に固執することになり、元々相反する目標（例えば、プロンプトへの準拠と画像編集におけるソース忠実さ）について、推論時に制御を可能にするものではありません。私たちは、単一の拡散モデルを完全にパーセトフロンティアを近似するように訓練する、マルチオブジェクト强化学習 (MORL) フレームワークである ParetoSlider を導入します。継続的に変化する優先度重みを条件符号としてモデルを訓練することで、ユーザーは再訓練や複数のチェックポイントの維持なしに、推論時に最適なトレードオフをナビゲートできます。私たちは SD3.5、FluxKontext、および LTX-2 の 3 つの最先端フロー適合のバックボーンにおいて ParetoSlider を評価しました。単一の優先度条件付きモデルは、固定された報酬のトレードオフのために個別に訓練されたベースラインに匹敵するかそれ以上のパフォーマンスを示し、競争的な生成目標の微細な制御をユニークに提供します。

Original Content

arXiv:2604.20816v1 Announce Type: new Abstract: Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.