arxiv_cs_ai 2026年2月10日

TermiGen: 完全であり信頼性のある環境と Robust Trajectory の生成による終端エージェントの高度な合成

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Translated: 2026/3/7 8:21:13

Japanese Translation

拡大重み付きLLMによって複雑な終端タスクを実行するというのは、現在大きな挑戦となっています。2つの基本的な制限がそのようなタスクを開発に支えています。まず、信頼性の高い実世界のリポジトリから合成された訓練環境は多様ではなく拡張不可能であり、LLMによって生成されるトラジェクトリーは偽の物が出ます。2番目に、一般的な指示チューニングはエキスパートのトラジェクトリーを用いますが、小さなモデルに多い単純なミスを見逃すことは頻繁にあります。これにより学生のモデルは自己実行の失敗から回復するのに適していない状態になるにつながります。これら二つのギャップを埋めるために、我々はTermiGenという全般的なpipelineとして新しく紹介します。完全で信頼性のある環境とRobust Trajectoryの生成を用いるものです。Termi Genは、機能的に正しいタスクとDockerコンテナが生成される、その上での迭代型 multi-agent修整ループを用います。したがって、我々はジェネレーターキャピタルプロトコルを使用して、ジェネレートされたデータが大量の誤差修正サイクルに豊富である事を確認します。このTermiGen生成したデータに対してチューニングしたResultinGen-Qwen2. 5-Coder-32Bによって我々はTermi GenがTERMINALBAHNでのテストで平均31, 3%の通過率を達成しています。これは一般的な重みつき模型の新たな開発物のベストアーティシャンとして設定されており、現存するベースラインを大きく凌駕し、 proprietaryモデルを凌駕していることを意味します。データセットは avaiable につづられてGithub: 삵동시홱무 https://github.com/ucsb-mlsec/terminal-bench-env. よりいます。

Original Content

arXiv:2602.07274v1 Announce Type: new Abstract: Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.