arxiv_cs_cv 2026年4月24日

ImVideoEdit: 2D 空間差分注意ブロックを介した画像学習型動画編集

ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Translated: 2026/4/24 19:51:47

imvideodatevideo-editingspatial-differenceattention-mechanismstext-guided-gating

Japanese Translation

arXiv:2604.07958v2 発表タイプ：置き換え要約：既存の動画編集モデルは高コストなペア付き動画データに依存しており、実用的な拡張性の制限があります。本質的に、ほとんどの動画編集タスクは、事前学習済みモデルの時間的ダイナミクスを維持しつつ空間的内容を選択的かつ精密に修正する、解耦された時空間プロセスとして形式化できます。この洞察に基づき、画像ペアのみから完全に動画編集能力を学習する効率的なフレームワーク ImVideoEdit を提案します。事前学習済み 3D 注意モジュールを固定し、画像を単一フレームの動画とみなすことで、元の時間的ダイナミクスを保持するために 2D 空間学習プロセスを解耦します。私たちのアプローチの核は、空間差を段階的に抽出して注入する「予測 - 更新空間差分注意モジュール」です。硬質な外部マスクに依存する代わりに、適応的かつ暗黙的なテキスト駆動の修正を実現するために、テキストに誘導された動的セマンティックゲート機構を統合します。たった 13K 件の画像ペアを 5 ヵ月（epocs）で訓練し、計算負荷が極めて低いにもかかわらず、ImVideoEdit は大規模な動画データセットで訓練された大型モデルと同等の編集忠実性と時間的整合性を達成しました。

Original Content

arXiv:2604.07958v2 Announce Type: replace Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.