arxiv_cs_lg 2026年2月10日

Think2SQL: 文字から SQL への変換における大規模言語モデルの論理推論能力の強化

Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

Translated: 2026/3/15 9:04:48

think2sqltext2sqlllmreinforcement-learningrlvr

Japanese Translation

arXiv:2504.15077v3 Announce Type: replace 要旨: 大規模言語モデル（LLM）は、文字から SQL への変換（Text-to-SQL）において最先端技術を実現しましたが、パラメータ効率的なモデルにおいて、複雑な多テーブル環境での堅牢な論理推論は依然としてボトルネックとなっています。本論文は、検証可能な報酬を伴う強化学習（RLVR）の視点から、文字から SQL への変換への論理推論能力の注入に関する体系的な実証研究を提案します。当研究では、報酬密度、利得スケーリング、およびモデル容量の間に存在する決定的な相互関係を明らかにしました。私たちの分析は 4 つの主要な洞察を導き出しました。第一に、我々は個体レベルの細かなフィードバックを提供する実行ガイド付き密度報酬関数を提案し、これは二元信号や既存の最先端報酬よりも著しく優れています。第二に、我々は利得計算のメカニズムを分析し、大規模モデルが激しい利得スケーリングに基づく疎信号で優位に立つ一方で、小規模モデルは Dense 報酬と慎重なスケーリングを必要とすることを示しました。第三に、我々はコールドスタートの影響を評価し、 дистillation（蒸留）が RLVR パフォーマンスを常に改善しないこと、そして上回る微調整されたモデルが分布の模倣に陥りやすいことを示しました。第四に、我々はトレーニング効率のパレート前線をマッピングし、計算リソースの制約下で Text-to-SQL の論理推論を最適化するための洞察を提供しました。当社の研究は Think2SQL ファミリーを完成させました。我々の 40 億パラメータのモデルは、o3 など最先端のモデルと同等の論理推論能力を証明しています。我々はモデル、データセット、コードをリリースし、RLVR 最適化のための Text-to-SQL のコンパスを作成しました（https://anonymous.4open.science/r/Think2SQL-3B7F）。

Original Content

arXiv:2504.15077v3 Announce Type: replace Abstract: While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR). We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start, showing that distillation does not always improve RLVR performance and that supervised, fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL at https://anonymous.4open.science/r/Think2SQL-3B7F.