arxiv_cs_ai 2026年4月24日

ReProbe: LLM の内部状態を用いた多段階推論のテストタイムスケーリングの効率化

ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Translated: 2026/4/24 20:30:17

llmtest-time-scalingreasoninginternal-statesreinforcement-learning

Japanese Translation

arXiv:2511.06209v5 Announce Type: replace **Abstract** LLM は、長尺かつ多段階の推論連鎖を生成することで複雑なタスクを解決できます。テストタイムスケーリング（Test-Time Scaling、以下 TTS）は、中間の推論ステップの多様なバリアントをサンプリングし、その正解性を検証してから最良のステップを選択するプロセスを通じて、さらに性能向上を実現します。しかし、既存の検証手法、例えばプロセス報酬モデル（Process Reward Models、PRM）は計算コストが高く、大規模な人間またはモデル生成によるアノテーションが必要となります。本研究では、LLM の内部状態を探索（probing）する手法に基づく、軽量なステップ単位の推論検証の代替案を提案します。内部状態を持つ凍結済み LLM を用いて、推論生成中の推論ステップの信頼性を推定するトランスフォーマーベースのプロブをトレーニングします。アノテーションは、より大きい LLM（例：DeepSeek-R1）や、元のモデル自体による自己教師あり学習のいずれかで提供可能です。提案したプロブはパラメータ数が 1,000 万未満と軽量であり、複数のドメインにおいて、数学、プランニング、および一般的な知識問答を含め、最大 810 倍大きい PRM とも同等、あるいはそれを凌駕する性能を示しました。これらの結果は、LLM の内部状態が推論プロセスにおける確信度をエンコードし、ステップ検証の信頼できるシグナルとなり得ることを示唆しており、スケーラブルで汎用性の高い TTS、およびより内省能力を持つ LLM への道筋を提供することを示唆しています。

Original Content

arXiv:2511.06209v5 Announce Type: replace Abstract: LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and selecting the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of a frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be provided either by a larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or exceed the performance of PRMs that are up to 810x larger. These results suggest that LLM internal states encode confidence in their reasoning processes and can serve as reliable signals for step verification, offering a promising path toward scalable, generalizable TTS and more introspective LLMs.