arxiv_cs_lg 2026年2月10日

大規模言語モデルエージェントによる効率的な探索への道向き

Toward Efficient Exploration by Large Language Model Agents

Translated: 2026/3/15 9:04:56

reinforcement-learninglarge-language-modelsartificial-intelligencemachine-learningdata-efficient-rl

Japanese Translation

arXiv:2504.20997v2 Announce Type: replace 要旨: 強化学習 (RL) の成長分野の一つは、大規模言語モデル (LLM) を中心としたシーケンス決定エージェントの設計である。現代の LLM による自律的決定エージェントが多数の現実世界への応用を可能にすることは確かだが、その成功はデータ効率的な RL を可能とするエージェントを必要とする。RL のデータ効率化を達成する上で重要な障壁となる探索、この課題については近世の LLM エージェント設計の提案の多くが対処できていないことを示す。また、RL の文脈で探索を適切に扱える古典的アルゴリズムは、純粋な自然言語環境において運用するには技術的なハードルがある。この作品において、LLM を finetuning や in-context learning に頼って RL アルゴリズムを暗黙的に真似させるのではなく、既存の RL アルゴリズム（Posterior Sampling for Reinforcement Learning）を明示的に実装する方法を例示し、その統計学的効率的な探索能力は既に十分に研究されていることを示す。我々は、知られているデータ効率的な RL アルゴリズムの LLM ベース実装が、慎重な探索を要求する自然言語タスクにおいて著しく効果的であることを示す実証結果を提供する。

Original Content

arXiv:2504.20997v2 Announce Type: replace Abstract: A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.