arxiv_cs_ai 2026年4月24日

AEL：Open-ended な環境で進化・学習を行うエージェント

AEL: Agent Evolving Learning for Open-Ended Environments

Translated: 2026/4/24 20:28:25

llm-agentsreinforcement-learningself-improvementopen-ended-environmentsmemory-retrieval

Japanese Translation

arXiv:2604.21725v1 Announce Type: cross 要約大規模言語モデル（LLM）によるエージェントは、数百のエピソードにわたるオープンエンドな環境で動作するようになる一方、まだほとんどステートレス（状態を持たない）である。つまり、過去の実経験が未来の行動を改善する形で転換されることなく、各タスクは毎回最初から解かなくてはならない。その中心的な障壁とは「何を」覚えるかではなく、「何を」覚えたものを利用する方法にある。これは、どの検索ポリシーを適用するか、過去の結果をどう解釈するか、そして現在の戦略自体が変更すべきかどうかを判断する方法を含む。我々は、この障壁に対処する二重タイムスケールのフレームワークである extit{Agent Evolving Learning}（AEL）を導入する。高速タイムスケールでは、Thompson Sampling bandit（トムプソンサンプリングバンドイット）が、各エピソードでどの記憶検索ポリシーを適用すべきかを学習する。一方、低速タイムスケールでは、LLM 驱动的な（LLM 駆动の）反芻（reflection）が失敗パターンを診断し、原因の洞察をエージェントの意思決定プロンプトに注入することで、取得した証拠に対する解釈枠を与えていく。連続的なポートフォリオベンチマーク（10 セクター多様性の銘柄、208 エピソード、5 つのランダムシード）において、AEL はシャーペ比 2.13 ± 0.47 を達成し、5 つの公開された自己改善手法およびすべての非 LLM ベースラインを上回ると同時に、LLM ベースのアプローチの中で最も低いバリアンスを維持している。9 つのバリアントアブレーション（除去実験）の結果は「少ないことがよい」のパターンを示す：記憶と反芻の組み合わせはステートレスな基準に対して累積 58% の改善を生み出したが、我々が試験した追加のすべてのメカニズム（プランナー進化、ツール選択、コールドスタート初期化、スキル抽出、そして 3 つのクレジット割り当てる方法）はパフォーマンスを低下させる。これは、エージェントの自己改善におけるボトルネックは、経験を利用する方法を「自己診断」する能力にあるのではなく、アーキテクチャの複雑性を増やすことにあることを示している。コードとデータ：https://github.com/WujiangXu/AEL

Original Content

arXiv:2604.21725v1 Announce Type: cross Abstract: LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.