arxiv_cs_ai 2026年4月24日

RIFT: Reward-Informed Fine-Tuning を用いた否定サンプルの再利用

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

Translated: 2026/4/24 20:33:59

llmfine-tuningalignmentreinforcement-learningdata-efficiency

Japanese Translation

arXiv:2601.09253v2 Announce Type: replace-cross 要約：上級 Fine-Tuning（SFT）と拒否サンプリング Fine-Tuning（RFT）は、LLM のアラインメントにおいて標準的手法ですが、どちらも高価な専門家に依存するデータが必要か、あるいは貴重な否定サンプルを捨てるため、データ効率が低下しています。この課題に対処するため、我々はすべての自己生成サンプルを活用する、単純ながらも効果的な Framework、Reward Informed Fine-Tuning（RIFT）を提案します。RFT の硬い閾値決定とは異なり、RIFT は否定トラJECTORY を再利用し、スカラー報酬で損失を再重み付けすることで、モデル出力から得られる正解と否解の両方から学習を行います。非確定的な報酬の統合によって引き起こされるトレーニング崩壊（直接的な乗算が境界のない損失を生成する問題）を回避するために、数値的頑健性と最適化効率を確保する安定化された損失表現を導入しました。様々なベースモデルにおける数学的ベンチマークにおける大規模な実験結果は、RIFT が RFT を一貫して上回ることを示しています。我々の結果は、RIFT が品質の異なる自己生成データを用いたアラインメントにおいて、堅牢でデータ効率的な代替手段であることを示唆しています。

Original Content

arXiv:2601.09253v2 Announce Type: replace-cross Abstract: While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data inefficiency. To address this, we propose Reward Informed Fine-Tuning (RIFT), a simple yet effective framework that utilizes all self-generated samples. Unlike the hard thresholding of RFT, RIFT repurposes negative trajectories, reweighting the loss with scalar rewards to learn from both the positive and negative trajectories from the model outputs. To overcome the training collapse caused by naive reward integration, where direct multiplication yields an unbounded loss, we introduce a stabilized loss formulation that ensures numerical robustness and optimization efficiency. Extensive experiments on mathematical benchmarks across various base models show that RIFT consistently outperforms RFT. Our results demonstrate that RIFT is a robust and data-efficient alternative for alignment using mixed-quality, self-generated data.