arxiv_cs_ai 2026年4月24日

HARBOR：自動化されたハルネス最適化

HARBOR: Automated Harness Optimization

Translated: 2026/4/24 20:22:04

harnessbayesian-optimizationagentsconfiguration-search

Japanese Translation

arXiv:2604.20938v1 Announce Type: cross 摘要：長期的な言語モデルエージェントは、コード行数と運用の複雑さにおいて、その下層にあるモデルよりもそれをラップする「ハルネス」によって支配されています：コンテキストの圧縮、ツールのキャッシュ、意味記憶、軌道の再使用、推測的なツール予測、そしてモデルをサンドボックス実行環境に結合するグルーである。我々は、ハルネス設計が 1 クラスの機械学習問題であり、フラグ空間が少数のビットを超えると、自動化された構成検索が手動スタッキングに優れていると主張します。この主張を 2 つのステップで正当化します。まず、自動化されたハルネス最適化を、冷開始補正された報酬と後方可能性制約制約された安全チェックを含む、混合変数、コスト非均質な構成空間における制約付きノイズ付きベイズ最適化として形式化し、ブロック加算型 SAAS surrogate（代替モデル）、多精度コスト意識型アキュイシメント、および TuRBO 信頼区域に基づく参考ソルバーである HARBOR（Harness Axis-aligned Regularized Bayesian Optimization Routine）を提供します。次に、生产コードエージェント上のフラグ制約付きハルネスにおけるこの問題を具体化し、固定タスクセットおよびエンドエンドの HARBOR 実行に対する管理された 4 ラウンドの自動調整事例を報告します。この形式自体はタスククラス非依存であり、制限されたフラグ空間と再現可能なタスクセットを有するどんなエージェントハルネスに対しても、構成空間、報酬補正、アキュイシメント、および安全チェックが適用されます。

Original Content

arXiv:2604.20938v1 Announce Type: cross Abstract: Long-horizon language-model agents are dominated, in lines of code and in operational complexity, not by their underlying model but by the harness that wraps it: context compaction, tool caching, semantic memory, trajectory reuse, speculative tool prediction, and the glue that binds the model to a sandboxed execution environment. We argue that harness design is a first-class machine-learning problem and that automated configuration search dominates manual stacking once the flag space exceeds a handful of bits. We defend this claim in two steps. First, we formalize automated harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space with cold-start-corrected rewards and a posterior chance-constrained safety check, and give a reference solver, HARBOR (Harness Axis-aligned Regularized Bayesian Optimization Routine), built from a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. Second, we instantiate the problem in a flag-gated harness over a production coding agent and report a controlled four-round manual-tuning case study against a fixed task suite and an end-to-end HARBOR run. The formulation itself is task-class agnostic: the configuration space, reward correction, acquisition, and safety check apply to any agent harness with a bounded flag space and a reproducible task suite.