arxiv_cs_cv 2026年2月10日

PISCO: スパース制御による正確な動画インスタンス挿入

PISCO: Precise Video Instance Insertion with Sparse Control

Translated: 2026/3/15 19:05:45

Japanese Translation

arXiv:2602.08277v1 Announce Type: new 【要旨】AI 動画生成の景観は決定的な転換期にあります。汎用的な生成（大規模プロンプトエンジニアリングや「イチオシ」選択に依存）を越え、微細な制御性を持ち、高い忠実度を備えたポストプロダクションへと移行するのです。プロフェッショナルな AI 支援映画制作において、正確なターゲット指向の変更可は極めて重要です。この転換期の柱となるのが、シーン全体の整合性を維持しつつ既存の映像中に特定のインスタンスを挿入する「動画インスタンス挿入」です。従来の動画編集とは異なり、このタスクは厳密な空間時間配置、物理的に整合的なシーンの相互作用、および元々のダイナミクスを忠実に保持するといった、数多くの要件を課します。しかし、それらはすべて最小限のユーザー努力で達成可能である必要があります。本稿では、任意のスパースキーフレーム制御を実現する正確な動画インスタンス挿入のための動画拡散モデル「PISCO」を提案します。PISCO では、ユーザーは単一のキーフレーム、開始と終了のキーフレーム、または任意のタイムスタンプに配置されたスパースキーフレームを指定でき、オブジェクトの出現、動き、相互作用が自動的に伝搬されます。事前に学習した動画拡散モデルがスパース条件付けにより引き起こす深刻な分布シフトに対処するため、頑健な条件付けおよび時間生成の安定化のために「Variable-Information Guidance」を導入し、同時に「Distribution-Preserving Temporal Masking」を適用しました。さらに、リアリズムの高いシーン適応のための「幾何学的な条件付け」を実装しました。また、確認されたインスタンスアンノテーションを備えたベンチマーク「PISCO-Bench」を構築し、そのペアリングされたクリーンな背景映像を用いて、参照ベースおよび参照フリーの感知指標を用いた性能評価を行いました。実験結果は、PISCO がスパース制御条件下において、強力なインペイントおよび動画編集ベースラインを常に上回ることを示しており、追加の制御信号が提供されるにつれて明らかな単調な性能改善が見られました。プロジェクトページ：xiangbogaobarry.github.io/PISCO

Original Content

arXiv:2602.08277v1 Announce Type: new Abstract: The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.