arxiv_cs_cv 2026年4月20日

効率的な Video Diffusion モデル：進展と課題

Efficient Video Diffusion Models: Advancements and Challenges

Translated: 2026/4/20 10:43:53

video-diffusiongenerative-aineural-networksvideo-synthesisdeep-learning

Japanese Translation

arXiv:2604.15911v1 Announce Type: new 要約：Video diffusion モデルは、高忠実度生成動画合成において急速に支配的なパラダイムに成長しましたが、その実用的展開は推論コストの激しい制約に留まっています。画像生成と比較して、動画合成は空間・時間トークンの増大と反復デノイズを通じて計算を累乗的に複雑化させ、現実世界の運用において注意とメモリトラフィックを主要なボトルネックとなっています。本調査は、効率的な Video diffusion モデルをシステム적이고実展開指向にレビューします。我々は、段階的精要（step distillation）、効率的な注意機構、モデル圧縮、およびキャッシュ/軌道最適化を含んだ 4 つの主要パラダイムを包括する統合的分類を提案しました。この分類に基づき、これらの 4 つのパラダイムのアルゴリズム的傾向をそれぞれ分析し、異なる設計選択がどのように 2 つの核心目的（関数呼び出し数の削減と、単一ステップごとのオーバーヘッドの最小化）を志向しているかを検討しました。最後に、複合加速下での品質維持、ハードウェア・ソフトウェア共設計、頑健なリアルタイム長期生成、そして標準化された評価のためのオープンインフラを含む、まだ解決されていない課題と未来の方向性を議論しました。現在のところ、我々の仕事が効率的な Video diffusion モデルに関する包括的な調査の最初であると考えられています。研究者とエンジニアにとって、この分野とその新興的研究方向の構造化された概略を提供します。

Original Content

arXiv:2604.15911v1 Announce Type: new Abstract: Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.