arxiv_cs_cv 2026年2月10日

VFace: 訓練を要さない拡散モデルベースのビデオフェイクロスワッピングへのアプローチ

VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

Translated: 2026/3/15 19:02:16

diffusion-modelsface-swappingvideo-generationcomputer-visionarxiv-ai

Japanese Translation

arXiv:2602.07835v1 発表タイプ：新規要旨：我々は、高品質なビデオフェイクロスワッピングのための訓練を要さない plug-and-play メソッド、すなわち VFace を提案します。画像ベースのフェイクロスワッピングアプローチを拡散モデルに基づき構築する際、この方法はシームレスに統合可能です。まず、生成と保ちた鍵の同一性特徴を容易にする周波数スペクトムアテンション挿入技術を紹介します。次に、ターゲット構造へのガイダンスを、より高い精度でターゲットフレームからの構造特徴を生成と整列させるために、プラグアンドプレイなアテンションインジェクションを用いて実現します。さらに、フレームごとの生成で典型的に見られる時間的不整合を減らすために、基礎的な拡散モデルを変更せずに時空間的一整合性を強制するフローガイデッドアテンションTemporal スムースニング機構を提示します。我々の方法は、追加の訓練やビデオ特有のファインチューニングを必要としません。大規模な実験により、我々の方法は時間的一整合性と視覚的忠実性を著しく向上させ、ビデオベースのフェイクロスワッピングのための実用的でモジュール化された解決策を提供することになりました。我々のコードは https://github.com/Sanoojan/VFace で利用可能です。

Original Content

arXiv:2602.07835v1 Announce Type: new Abstract: We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.