arxiv_cs_ai 2026年2月10日

学習を自審査することで、言語モデルのリテラシーよりも高い理由力を持つ

Learning to Self-Verify Makes Language Models Better Reasoners

Translated: 2026/3/7 12:37:46

language-modelsself-verificationreinforcement-learninggenerative-capacity

Japanese Translation

最近の大きな言語モデル（LLM）は複雑なタスクで promisingな反復的な道筋を生成するのに強力な能力があります。しかし、LMMは自己答の確認については非常に弱く、自作の能力とその他の方向性の間には恒常的に能力の不均等が存在します。この著者は、これらの違いについて深度ある探究を行っており、各モデルが同じタスクでも生成力を向上させることでその自己検証は改善しなかったという結果を明らかにしています。さらに興味深いことに、自身の確認学習を学ぶ側にも方向性としては異なる結果があります：生成力を効果的に向上させるためです。これは、一般的な生成トレーニングで比較されると同等以上の正確性を得た一方で、より効率が高く、さらに効果的な理由作成過程を得ました。その観察に基づき、作中の作業を加味した複数タスクの強化式学習フレームワークにおいて、生成と自審査は独立して但し補完する2つの完全に独立した目的として最適化することに焦点を当てるようになりました。各種ベンチマーケットおよびモデルによる広範な試験結果は両言語モデルの能力のための改善的なパフォーマンスを示しています。

Original Content

arXiv:2602.07594v1 Announce Type: cross Abstract: Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.