arxiv_cs_lg 2026年4月24日

HiPO: LLM における適応的推理のための階層型好ましさ最適化

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Translated: 2026/4/24 20:03:36

llmreinforcement-learningpreference-optimizationreasoninghuman-alignment

Japanese Translation

arXiv:2604.20140v1 発表タイプ：cross 摘要: 直接好ましさ最適化 (Direct Preference Optimization, DPO) は、大規模言語モデルを人間の好ましさに合わせて整えるための効果的なフレームワークですが、複雑な推理タスクでは限界があります。DPO は好ましい応答と不向きな応答全体の生成確率を最適化するため、推理タスクに特有の多数のステップにわたる解答のサブセクションに対するフィードバックを提供する粒度性に欠けます。既存の手法は、安定した好ましさ学習 (例：KTO や RSO などの DPO 変種) か構造化された推理 (例：ReMA のマルチエージェント RL フレームワーク、Think of Trees など) に優れていますが、これらの補完的な特性を融合させることに失敗しています。私々は HiPO (Hierarchical Preference Optimization) を提案しました。これは応答を推理セグメント（質問の明確化と文脈、推理ステップ、解答）に分解し、各セグメントの DPO ロスを加重和として計算する DPO の拡張です。私たちのアプローチは、セグメント固有のトレーニングを可能にしつつ、DPO の計算効率とトレーニング安定性を維持します。HiPO と DPO を使用して 7B LLM をファインチューニングした場合、Math Stack Exchange の好ましさデータセット上では、HiPO でトレーニングされたモデルが様々な一般的な数学ベンチマークで他モデルを上回る性能を示し、GPT-4.1 による組織性、論理的な流れ、一貫性の測定ではより高い結果を得たことを示しました。

Original Content

arXiv:2604.20140v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.