arxiv_cs_ai 2026年2月10日

InftyThink+: 全端から最適化された無限 horizonの推論とリコメンド学習を使用した有効で高度な推理

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Translated: 2026/3/1 14:34:02

Japanese Translation

大型の思考モデルは、予測時間を伸ばすことで強力なパフォーマンスを達成しますが、この手法には二次元コストとcontextの長さの制限、計算不能な結果から思考の途中で失われた効果があると一般的に認識されています。繰り返しの思考はこれらの問題を改善するのに役立ちますが、その他の方法は監督学習や固定されたヒューリスティクを使用して中間的な思考を定期的に要約することが求められます。したがって、中間的な思考を適時に要約することにより、存在する法則性を推論の途中で失う効果を改善できるようにし、またこれらの選択についてモデルを最適化することが不可能です。我々は本質的にInftyThink+と名付けられた、完全な繰り返し的な思考プロセス全体への最適化を行うための大域的学習フレームワークを提案しました。この手法は、モデルが決定論的または一貫した思考の境界をコントロールし、明示的に要約することが提案されました。InftyThink+では、一部の繰り返しを実行しながら一部をスキップするための双段階トレーニングスケッチャがありました。初期学習は監督学習で行われ、その後のプロセスレベルでの最適化により、モデルが戦略的に要約や継続の決定を行うように訓練されました。デュープサキードークローシンク-R1―ディステアリング-クウェン-1. 5Bに対するテストで、InftyThink+はAIME24に対する正確さを向上させたのは21%でした。また、標準的な長い過程での最適化推論の技術が大きく上回っています。さらに、この学習は計算時間を大幅に短縮し、その理由と同期性が改善されました。

Original Content

arXiv:2602.06960v2 Announce Type: replace-cross Abstract: Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.