arxiv_cs_ai 2026年4月24日

長期タスクのための協調進化型 LLM 意思決定エージェントとスキルバンクエージェント

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Translated: 2026/4/24 20:14:45

llmagentreinforcement-learninggame-aiskill-bank

Japanese Translation

arXiv:2604.20987v1 Announce Type: new 要約：長期到達性の相互作用環境は、エージェントのスキル使用能力を評価するためのテストベッドです。これらの環境では、多段階の推理、多数の時間ステップにわたる複数のスキルの連鎖、そして遅延報酬と部分観測下での堅牢な意思決定が求められます。ゲームは、エージェントのスキル使用能力を評価するための良いテストベッドです。大規模言語モデル (LLMs) はゲームプレイエージェントとして有望な代替手段を提供しますが、エピソードを超えて構造化されたスキルを発見し、保持し、再利用するメカニズムを欠くため、一貫した長期到達性の意思決定 struggles を経験する傾向があります。COSPLAY は、LLM 意思決定エージェントが学習可能なスキルバンクからスキルを retrieves してアクションの導入を支援し、エージェント管理されたスキルパイプラインがエージェントのラベル付けされていないロールアウトから再利用可能なスキルを発見してスキルバンクを形成する協調進化フレームワークを提示します。我々のフレームワークは、意思決定エージェントがより良いスキル retrieval とアクション生成を学習するだけでなく、スキルバンクエージェントがその契約とともにスキルの抽出、精化、更新を継続的に実行することを可能にします。6 つのゲーム環境にわたる実験では、COSPLAY は単人ゲームベンチマークにおいて 4 つの最先端 LLM ベースラインに対して 25.1% を超える平均報酬改善率达到し、マルチプレイヤー・ソーシャル推理ゲームについては依然として競争力を持っていることが示されました。

Original Content

arXiv:2604.20987v1 Announce Type: new Abstract: Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.