cloudflare_blog 2026年4月20日

Unweight：品質を犠牲化せずに LLM を 22% 圧縮する方法

Unweight: how we compressed an LLM 22% without sacrificing quality

Translated: 2026/4/20 11:19:37

llmmachine-learninginference-optimizationcloudflaregpu-memory

Japanese Translation

Cloudflare のネットワーク全体で LLM を動かすには、GPU メモリ帯域幅についてより賢く効率的に扱う必要があります。そのために、推論時に無損失で圧縮できる「Unweight」というシステムを開発し、モデルフットプリントを最大 22% 削減しました。これにより、かつてない速さと安価な推論サービスを提供できるようになりました。

Original Content

Running LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that achieves up to a 22% model footprint reduction, so that we can deliver faster and cheaper inference than ever before.