arxiv_cs_lg 2026年4月24日

PayPal の Commerce Agent を加速する予測推論：EAGLE3 と微調整済み Nemotron モデルを用いた経験的研究

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

Open original article

Translated: 2026/4/24 19:54:18

speculative-decodingeagle3paypalnemotronllm-optimization

Japanese Translation

arXiv:2604.19767v1 Announce Type: new Abstract: 私たちは、微調整済みの llama3.1-nemotron-nano-8B-v1 モデルをベースに PayPal の Commerce Agent に対して、推論時間における最適化手法として EAGLE3 を用いた予測推論を検証しました。先行研究（NEMO-4-PAYPAL）において、ドメイン固有な微調整を通じて遅延とコストを削減してきたという成果を踏まえ、EAGLE3 を vLLM 上でベンチマークし、2xH100 ハードウェアを用いた 40 の構成（推測トークン数：gamma=3, gamma=5、並行レベル：1-32、サンプリング温度：0, 0.5）にわたって NVIDIA NIM と対比させました。主要な見通しは以下の通りです：（1）gamma=3 では追加のハードウェアコストなしでスループット改善 22-49%、遅延削減 18-33% を達成した；（2）すべての条件下で gamma=3 の受容率は安定して約 35.5% を維持した；（3）gamma=5 では回帰効果が小さく、受容率は約 25% 程度に低下した；（4）LLM-as-Judge による評価では出力品質の保全が確認された；（5）単一の H100 における予測推論は、2 つの H100 を使用する NIM に匹敵またはそれを超え、GPU コスト削減率 50% を可能にした。

Original Content

arXiv:2604.19767v1 Announce Type: new Abstract: We evaluate speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by a fine-tuned llama3.1-nemotron-nano-8B-v1 model. Building on prior work (NEMO-4-PAYPAL) that reduced latency and cost through domain-specific fine-tuning, we benchmark EAGLE3 via vLLM against NVIDIA NIM on identical 2xH100 hardware across 40 configurations spanning speculative token counts (gamma=3, gamma=5), concurrency levels (1-32), and sampling temperatures (0, 0.5). Key findings: (1) gamma=3 achieves 22-49% throughput improvement and 18-33% latency reduction at zero additional hardware cost; (2) acceptance rates remain stable at approximately 35.5% for gamma=3 across all conditions; (3) gamma=5 yields diminishing returns (approximately 25% acceptance rate); (4) LLM-as-Judge evaluation confirms fully preserved output quality; and (5) speculative decoding on a single H100 matches or exceeds NIM on two H100s, enabling 50% GPU cost reduction.