arxiv_cs_lg 2026年2月10日

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

Translated: 2026/3/15 14:10:24

test-time-trainingve-generative-aireinforcement-learninggpu-kernel-optimizationllm-inference

Japanese Translation

arXiv:2602.07670v1 Announce Type: new 摘要：テストタイムトレーニング（TTT）は、推論時に勾配ベースの更新を通じて言語モデルを適応させる手法です。しかし、適応が適切な戦略なのかどうかは疑問です。我々は、検証可能な実行基礎（VEG）タスク、例としては GPU カーネル最適化など、決定論的な評価者が密集した連続的な報酬シグナルを提供するドメイン向けに、計算効率の高いテストタイム戦略を研究しました。KernelBench をテストベンチとし、120B パラメータモデル（GPT-OSS-120B、LoRA 適応）を用いたところ、検索が最小適応（1〜5 の勾配ステップ）を上回ることを発見しました。Best-of-N サンプル化は、全 KernelBench L1 評価セットで K=64 の時点でタスク成功率 90%（18/20 タスク）を達成しましたが、TTT のベストチェッポックは 30.6%（3 シード平均）に留まり、TTT の「等価 K」は 1 未満になり、単一サンプル推論よりも劣りました。この失敗モードは「過剰シャープ化」であり、勾配更新は最適な解を発見するのではなく、平均的な解へと多様性を崩壊させるのです。我々の主な貢献は、驚き導出選択（surprisal-guided selection）です。最も高い驚き（最低の信頼）を持つ正解サンプルを選択することは、最も信頼性の高い選択と比較して成功率が 80%（最も信頼性の高い選択は 50%）となり、30% の改善を達成しました。驚き導出トップ 3 に拡張し、それがオーラ性能と一致する 100% まで達したのは、長さが制御された分析を通じて検証されたゼロコスト戦略です。密集報酬 VEG タスクにおいて、計算資源は勾配適応ではなく、サンプル多様性と知能選択に割り当てられるべきです。驚き導出選択の原理は、最適な解が分布の尾に位置する他の実行基礎ドメインにも一般化できる可能性があります。

Original Content

arXiv:2602.07670v1 Announce Type: new Abstract: Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.