arxiv_cs_ai 2026年4月24日

Speculative Actions: Lossless Framework for Faster Agentic Systems

Speculative Actions: A Lossless Framework for Faster Agentic Systems

Translated: 2026/4/24 20:30:07

speculative-executionagent-systemsllm-inferencelatency-reductionmicroprocessor

Japanese Translation

arXiv:2510.04371v2 Announce Type: replace Abstract: 人工知能（AI）エージェントは、複雑で相互作用的な環境でますます広く展開されているが、その実行時間がトレーニング、評価、そして実世界的应用における主要なボトルネックとなっている。一般的なエージェントの動作は順次的に展開され、各アクションは API 呼び出しを必要とし、それが大幅なレイテンシを引き起こす可能性がある。例えば、二つの最先端エージェントがチェスを対局するのには数時間かかる。本研究では、一般的なエージェントシステム向けのロスレス（Lossless）加速枠組み「Speculative Actions」を提案する。我々の手法は、マイクロプロセッサにおけるSpeculative Execution や大規模言語モデル（LLM）推論におけるSpeculative Decoding に灵感を得ており、より高速なモデルを使用して可能性の高い将来のアクションを予測し、それらを並列に実行する。予測が一致した場合のみコミットを行う。我々はゲーム、EC マーケット、および Web 検索の環境にわたって Speculative Actions を評価し、さらに OS 設定におけるロス（Lossy）拡張についても研究を行った。ドメインを跨って、我々は次アクションの予測精度を最大 55% 向上させ、レイテンシ削減を最大 20% 達成した。最後に、Speculative breadth と時間節約間のトレードオフを形式化するコストとレイテンシ分析を提示する。この分析は、多分岐推論が実用的な速度向上をもたらすのに、過度なコスト増を引き起こさないよう、原則に基づいたチューニングと選択的なブランチランチャーを可能にする。

Original Content

arXiv:2510.04371v2 Announce Type: replace Abstract: AI agents are increasingly deployed in complex, interactive environments, yet their runtime remains a major bottleneck for training, evaluation, and real-world use. Typical agent behavior unfolds sequentially, with each action requiring an API call that can incur substantial latency. For example, a game of chess between two state-of-the-art agents can take hours. We introduce Speculative Actions, a lossless acceleration framework for general agentic systems. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, our method uses faster models to predict likely future actions and execute them in parallel, committing only when predictions match. We evaluate speculative actions across gaming, e-commerce, and web search environments, and additionally study a lossy extension in an operating systems setting. Across domains, we achieve up to 55% next-action prediction accuracy, translating into up to 20% latency reductions. Finally, we present a cost-latency analysis that formalizes the tradeoff between speculative breadth and time savings. This analysis enables principled tuning and selective branch launching to ensure that multi-branch speculation delivers practical speedups without prohibitive cost growth.