arxiv_cs_ai 2026年4月24日

Tool Attention: 動的ツールゲートイングと惰性スキーマ読み込みによる MCP/ツールの課金免除を目的としたスケーラブルなエージェントワークフロー

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Open original article

Translated: 2026/4/24 20:18:39

tool-attentionmodel-context-protocolllm-agentsinference-optimizationscalable-workflows

Japanese Translation

arXiv:2604.21816v1 Announce Type: new\n\nSummary: Model Context Protocol (MCP) は、大規模言語モデル（LLM）エージェントと外部ツールの接続のための一般的なインターフェースとなっていますが、ステートレスな前向きスキーマ注入への依存により、実務家が報告する隠れた「MCP課金」または「ツールの課金」という一回あたりオーバーヘッドが発生します。これは通常、マルチサーバーデプロイメントで 10k から 60k トークンの範囲にあります。このペイロードはキーバリューキャッシュを膨らませ、コンテキスト利用率が公開された折れ線点（約 70%）に近づくと推理能力の劣化に関連付けられ、トークン予算を継続的な運営コストに変えてしまいます。\n\n我々は、「Tool Attention」を提案し、トークンへの自己注意の「Attention Is All You Need」パラダイムを、ツールへのゲートイング注意へと一般化しました。Tool Attention は、(i) 文埋め込みから得られた意図スキーマ重複（ISO）スコア、(ii) 事前条件とアクセススコープを強制するステートアウェアゲートング関数、および (iii) コンテキスト内のコンパクトサマリープールの保持と上位 k つのゲートングツールに対するのみ完全な JSON スキーマのプロモーションを行う、2 つのフェーズの惰性スキーマロダースを組み合わせました。\n\n我々は、120 ツールからなるシミュレートされたベンチマーク（6 サーバー）上でこれを評価しました。各サーバーのトークン数は、公開された MCP デプロイメントのリアルアウディットに基づいてカルイブレードされています。このシミュレーションにおいて、Tool Attention は測定された一回あたりツールトークンを 95.0%（47.3k -> 2.4k）削減し、有効なコンテキスト利用率（トークン比率量）を 24% から 91% に引き上げました。タスク成功率、レイテンシ、コスト、そして推理品質のエンド・トゥ・エンドデータは、測定されたトークン数と公開されたデプロイメントテレメトリを組み合わせることで導き出された予測値として報告されます。これは生きている LLM エージェントで測定された値ではなく、我々は投影された値を明確に標記しています。\n\nまとめると、これらの結果は単純な定理を支えています：プロトコルレベルの効率性、生のコンテキスト長の大小が、スケーラブルなエージェントシステムにおける制約条件となります。この作業のコードは https://github.com/asadani/tool-attention にアクセス可能です。

Original Content

arXiv:2604.21816v1 Announce Type: new Abstract: The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention