arxiv_cs_cv 2026年4月24日

Sparse Forcing: 本地式学習可能なスパスな注意機構を用いたリアルタイム自動回帰拡散ビデオ生成

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

Translated: 2026/4/24 19:41:53

sparse-forcingautoregressive-diffusionvideo-generationsparse-attentiongpu-acceleration

Japanese Translation

arXiv:2604.21221v1 Announce Type: new Abstract: 自動回帰ビデオ拡散モデルにおいて、長期生成の品質向上とデコード遅延の低減を図る訓練および推論のパラジグム「Sparse Forcing」を提示します。Sparse Forcing は、自動回帰拡散ロールアウトにおける経験的事実に基づくことで：注意機構は目立つ視覚的ブロックの持続的な一部に集中し、KV キャッシュの中に暗黙的な時空間記憶を形成し、スライディングウィンドウ内では局所的に構造化されたブロックスパースパターンを示すことが観察されました。この観察に基づき、持続的なブロックを圧縮・保存・更新しながら、各局所ウィンドウ内における計算を動的に選択された局所近傍に制限する、学習可能な本地的なスパース性機構を提案します。スケール対応した訓練と推論における実用性を高めるために、低遅延かつメモリエファレンスの低いデコードを加速するスパース注意とメモリアルファの更新を高速化する効率的な GPU カーネル「Persistent Block-Sparse Attention (PBSA)」をさらに提案します。実験により、Sparse Forcing は 5 秒間のテキストからビデオ生成において Self-Forcing に対して VBench スコアを +0.26 向上させ、デコード速度を 1.11〜1.17 倍向上させ、ピーク KV キャッシュフッティングを 42% 削減しました。これらの効果はより長い時間範囲のロールアウトで顕著に現れ、20 秒間および 1 分間間の生成ではそれぞれ VBench +0.68 と +2.74 の改善、および 1.22 倍、1.27 倍の速度向上をもたらしました。

Original Content

arXiv:2604.21221v1 Announce Type: new Abstract: We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.