arxiv_cs_ai 2026年4月24日

数値推理における推論時の強化学習におけるスパス信号的増幅の理解と低減

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Translated: 2026/4/24 20:26:11

test-time-reinforcement-learningspurious-correlationlarge-language-modelsmathematical-reasoningreinforcement-learning-baseline

Japanese Translation

arXiv:2604.21327v1 Announce Type: cross 要約：推論時の強化学習（TTRL）は、擬似ラベリングを通じて推論時にモデルを適応させるため、ラベルノイズからのスパス信号に脆弱である。実証的研究を通じて、我々は中程度の整合性を備えたレスポンスが不明確な領域を形成し、報酬ノイズの主な原因であることを観察した。重要なのは、このようなスパス信号がグループ相対的優勢評価を通じてさらなる増幅を受け得るという発見である。これらの発見を踏まえて、我々はスパス信号を低減するための統一的なフレームワーク「Debiased and Denoised test-time Reinforcement Learning (DDRL)」を提案した。具体的に、DDRL はまず曖昧なサンプルを除外しながら正例と負例のバランスを保つための周波数ベースのサンプリング戦略を適用し、次にグループ相対的ポリシー最適化で導入されるバイアスを除去するための固定 advantages を持つ無偏差優勢評価を採用し、最終的には拒絶サンプリングされたデータセットを利用したコンセンサスベースのオフポリシー-refinement ステージを組み込み、効率的かつ安定したモデル更新を実現する。複数の数学的推理ベンチマークを跨いで 3 つの大型言語モデルに関する実験において、DDRL は既存の TTRL ベースラインの一貫して優れていることが示された。コードは近日中に https://github.com/yuyongcan/DDRL に公開される予定である。

Original Content

arXiv:2604.21327v1 Announce Type: cross Abstract: Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.