arxiv_cs_lg 2026年2月10日

強化推論：不確実性を活用した自己修正言語モデルの推論

Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning

Translated: 2026/3/15 8:08:37

reinforcement-inferencelarge-language-modelsuncertainty-awareinference-time-controlmmlu-pro

Japanese Translation

arXiv:2602.08520v1 Announce Type: cross 要約：現代の大規模言語モデル（LLM）は、特に決定論的な行動を必要とするプロフェッショナルな場面で、一発勝負の貪欲な推論プロトコルの下で評価・運用されることが一般的です。この体制は、固定されたモデルの真の能力を系統的に過小評価します。多くの誤りは、知識の欠如からではなく、内部の曖昧さにおける早期の決断から生じます。我々は、モデル自身の不確実性を活用し、より慎重な推論試行を選択的に呼び出すためのエントロピー感知型の推論時制御戦略、「強化推論」を導入します。これにより、再トレーニングなしでより強力なパフォーマンスを実現します。 14 分野の 12,032 件の MMLU-Pro 問題を用い、DeepSeek-v3.2 をゼロショット設定で決定論的デコード方式で使用した場合、強化推論は精度を 60.72% から 84.03% へ向上させ、推論呼び出しの追加コストは 61.06% のみとなりました。100% 再提问アブレーション実験では 84.35% まで到達しており、不確実性感知に基づく選択が実可能な改善の大部分を捕捉しつつ、大幅に少ない計算資源で済むことを示しています。さらに、プロンプトのみのアブレーションはベースラインよりも劣勢に立っており、これらの効果は単なる「出力のエントロピーが高いため、段階的に考えてください」のような汎用的なプロンプトによるものではないことが示唆されています。我々の結果は、実用的な推論時アップグレードを提供するだけでなく、モデルの能力を測定し拡大するためのより広範なエントロピー感知型パラダイムを示唆しています。現代のデコーダーベースのモデルは自動再帰的に出力を生成するため、エントロピーや関連する信頼度指標は生成プロセス中において自然に 1 クラスの制御シグナルとして現れます。一発勝負の貪欲な推論と不確実性情報に基づく審議との間のギャップは、LLM の潜在推論限界を診断するレンズとなり、明示的に正解と信頼度の整合性を制約する未来のトレーニング目標を促しています。

Original Content

arXiv:2602.08520v1 Announce Type: cross Abstract: Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.