arxiv_cs_cv 2026年2月10日

Rolling Sink：自動回帰型ビデオ拡散モデルにおける有限時間トレーニングと無限時間テストの架け橋

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Translated: 2026/3/15 18:06:04

diffusion-modelsvideo-generationautoregressivelong-horizonrollout-optimization

Japanese Translation

arXiv:2602.07775v1 Announce Type: new Abstract: 最近、自動回帰型（AR）ビデオ拡散モデルは著しい性能を達成しました。しかし、トレーニング期間の制限により、長時間のホライズンでのテスト時にトレーニング - テストのギャップが発生し、急速な可視的劣化をもたらしています。トレーニング期間内におけるトレーニング - テストのギャップを研究した Self Forcing を踏まえ、本稿ではトレーニング期間を超えたトレーニング - テストのギャップ、すなわちトレーニング中の有限ホライズンとテスト中の無限ホライズンの間のギャップを研究します。無限時間テストは任意の有限トレーニングウィンドウを超えて拡張可能であり、長動画のトレーニングは計算コストが高いため、私たちはこのギャップを架けるためのトレーニングなしの解法を探求します。トレーニングなしの解法を探るために、AR カッシュ維持に対する系統的な分析を行いました。これらの知見は Rolling Sink を導き出しました。Self Forcing（わずか 5 秒のクリップでトレーニング）を基盤として、Rolling Sink はテスト時に超長時間のビデオ（例：16 FPS で 5 分〜30 分）をシナプス的に合成することを可能にし、一貫した被写体、安定した色、整合した構造、滑らかな動きを実現します。広範な実験による実証通り、Rolling Sink は最前線のベースラインと比較して優れた長ホライズンの可視的真実性と時間的一貫性を達成しました。プロジェクトページ：https://rolling-sink.github.io/

Original Content

arXiv:2602.07775v1 Announce Type: new Abstract: Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/