arxiv_cs_lg 2026年2月10日

AceGRPO: 自律型機械学習エンジニアリングのための適応的カリキュラム強化グループ相対政策最適化

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Translated: 2026/3/15 14:49:16

grporeinforcement-learningllm-agentautonomous-learningmachine-learning

Japanese Translation

arXiv:2602.07906v1 Announce Type: new Abstract: 自律型機械学習エンジニアリング（MLE）では、エージェントに長期 horizon をわたって持続的な反復最適化を行う必要があります。最近の LLM ベースのアジェンテは有望であるものの、ML 用の既存のプロンプトベースのアジェンテは、冻结されたパラメータによって行動の停滞を招く課題を抱えています。 although 強化学習（RL）はこの問題の解決策を提供しますが、ML に適用する際に実行遅延の過大化や効率的なデータ選択が大きな障害となっています。これらの課題を認識し、われわれは以下の 2 つの主要な構成要素を備えた AceGRPO を提案します：(1) 進化データバッファ：実行トレースを継続的に再利用可能なトレーニングタスクへ転用する機能と、(2) 学習可能性ポテンシャル関数に導かれた適応採択：エージェントの学習境界にあるタスクを動的に優先し、学習効率を最大化する機能。AceGRPO を活用し、我々の訓練された Ace-30B モデルは MLE-Bench-Lite で 100% の有効提出率を達成し、プロプライエタリーファントア Model のパフォーマンスに接近し、同時期により大きなオープンソースベースライン（例：DeepSeek-V3.2）を上回り、持続的な反復最適化における強固な機能性を示しています。コードは https://github.com/yuzhu-cai/AceGRPO で入手可能です。

Original Content

arXiv:2602.07906v1 Announce Type: new Abstract: Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.