arxiv_cs_lg 2026年4月24日

R2IF: Reasoning と決定をコンポジット報酬で整える、解釈可能な LLM 関数呼出しのためのコンポジション報酬に基づく学習

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

Translated: 2026/4/24 19:58:21

reinforcement-learningfunction-callingllmchain-of-thoughtgrpo

Japanese Translation

arXiv:2604.20316v1 Announce Type: new Abstract: 関数呼出しは、大規模言語モデル (LLM) に外部ツールとのインターフェースを提供しますが、既存の RL ベースのアプローチでは、推論プロセスとツール呼出しの決定との間に整合性の欠如が見られます。当論文では、解釈可能な関数呼出しのための推論感知型 RL フレームワークである R2IF を提案します。これは、フォーマット/正確性の制約、思考連鎖効果報酬 (CER)、および仕様変更価値報酬 (SMV) を統合したコンポジット報酬を採用し、GRPO を最適化しています。BFCL/ACEBench における実験では、Llama3.2-3B (BFCL) においてベースラインに比べて最大 34.62% 向上し、平均思考連鎖効果は 0.05 で正の値を示しました。これにより、関数呼出し精度と解釈性両方が向上し、信頼性の高いツール拡張型 LLM デプロイメントが可能になりました。

Original Content

arXiv:2604.20316v1 Announce Type: new Abstract: Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized via GRPO. Experiments on BFCL/ACEBench show R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) with positive Average CoT Effectiveness (0.05 for Llama3.2-3B), enhancing both function-calling accuracy and interpretability for reliable tool-augmented LLM deployment.