arxiv_cs_lg 2026年4月20日

エージェント型 AI 実行の理解・分析・最適化への道：CPU 中心の視点から

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Translated: 2026/4/20 11:07:12

agentic-aillm-servingcpu-gpu-systemssystem-bottleneckslatency-optimization

Japanese Translation

arXiv:2511.00739v3 Announce Type: replace-cross Abstract: エージェント型 AI サービングは、単一の LLM ベースの推論を、計画・ツール呼び出し・推論実行・即座に適応可能な自律問題解決者へ転換します。多様なタスク実行の要件により、そのようなサーバーは、エージェント型機能の責任の大部分を負う外部ツールが CPU 上で実行されるか、または CPU によってオーケストレートされる、不均質な CPU-GPU システムに依存しています。この役割に対するより深い理解を持つため、本稿は主に看過されている CPU 中心の視点から、エージェント型 AI ワークロードによって導入されるシステムボトルネックを特定し解析することを目的としています。まず、私たちはエージェント型 AI 実行のコンパイル時特性を提示し、代表的なワークロードを選定してアルゴリズムの多様性を捕捉します。その後、私たちは 2 つの異なるハードウェアシステムでエンドエンドの遅延とスループットを分析することで、それぞれのアーキテクチャ上のボトルネックを孤立させます。ボトルネックに関する洞察に基づき、最終的に 2 つのスケジューリング最適化を提示します。それらは 1. 同質性エージェント型ワークロード向けに CPU 認識オーバーラップマイクロバッチリング (COMB) と、2. 不均質エージェント型ワークロード向けに混在エージェント型スケジューリング (MAS) です。具体的には、これらの方法は、不均異実行におけるスケーリングしたリソース割り当てを減少させる同时、CPU-GPU の併列利用率向上を最適化します。2 つのハードウェアシステムに対する実験的評価は、COMB の有効性を示し、単独の同質性ワークロード実行で最大 1.7 倍低い P50 遅延をもたらすだけでなく、同質性のオープンループ負荷下で最大 3.9 倍/1.8 倍低いサービス/総遅延をもたらすことを示しました。さらに、不均質オープンループ負荷において、MAS は少数の要求タイプに対して P50/P90 パーセンタイルで最大 2.37 倍/2.49 倍低い総遅延を実現できます。

Original Content

arXiv:2511.00739v3 Announce Type: replace-cross Abstract: Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.