arxiv_cs_lg 2026年4月24日

BatchLLM: グローバルプレフィックス共有と Throughput 指向型トークンバッチ化による大規模バッチ LLM 推論の最適化

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Translated: 2026/4/24 20:11:25

batch-llmlarge-language-modelsgpu-utilizationllm-inferencetoken-batching

Japanese Translation

arXiv:2412.03594v3 Announce Type: replace-cross 要約: 大規模言語モデル（LLM）は、現在工業界の広範な情報処理・管理タスクにおいてますます重要な役割を果たしています。これらのタスクの多くは、大規模バッチ、あるいはオフラインで実行され、その性能指標は Throughput です。これらのタスクは、異なるプロンプト入力において共通のプレフィックスを共有するという特性を示すことが一般的であり、そのためには異なるプロンプト入力が部分的に共通のプレフィックスを示す必要があります。しかし、既存の LLM 推論エンジンとは streaming リクエストの最適化に焦点を当てており、プレフィックス共有という特性を持つ大規模バッチタスクに対応する限りには限界を示しています。既存のソリューションは、リクエスト間の共通プレフィックスの KV コンテキストを再利用するために LRU ベースのキャッシュを使用していますが、明示的なキャッシュ管理により、再利用される予定の KV コンテキストが早期に逐次逐次（evicted）されてしまうという問題があります。また、Streaming 指向型のシステムはリクエストのバッチ情報を活用しないばかりか、推論トークンとプレフィックスチャンクをバッチシナリオにとって最適な形で混合することもできず、GPU を飽和させることができません。この課題に対処するため、我々は BatchLLM を提案しました。BatchLLM は共通プレフィックスをグローバルに明示的に特定し、共通プレフィックスを共有するリクエストは KV コンテキストの再利用を最大化するために同時にスケジュールされます。BatchLLM はリクエストを再配列し、後続のプレフィックスチャンクと推論トークンをより良く混合するために、推論比率の大きいリクエストを先にスケジューリングし、さらにメモリ中心的なトークンバッチ化を適用してトークンバッチサイズを拡大することで GPU 利用率を向上させます。広範な評価により、BatchLLM は異なるハードウェア環境下で、いくつかのマイクロベンチマークおよび典型的な産業ワークロードにおいて、vLLM と SGLang をそれぞれ $1.3 imes$ から $10.8 imes$ まで上回る結果を示しました。ソースコードは https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064 に公開されています。

Original Content

arXiv:2412.03594v3 Announce Type: replace-cross Abstract: Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.