arxiv_cs_cv 2026年2月10日

SoulX-FlashHead: オラクル導向による無限のリアルタイムストリーミングTalking Heads の生成

SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Translated: 2026/3/15 18:03:08

soulx-flashheadaudio-driven-portraitreal-time-streamingbidirectional-distillationface-animation

Japanese Translation

arXiv:2602.07449v1 Announce Type: new Abstract: 高解像度の画像品質と低遅延ストリーミングとのバランスを保つことは、音声驱动的ポートレート生成において大きな課題です。既存の大規模モデルは計算コストが著しく高く、軽量な代替方案はまた、顔の全体表現や時系列安定性を犠牲にしています。本論文では、リアルタイム・無限長・高画質ストリーミングビデオ生成を設計した 13 億パラメータの統合型フレームワーク「SoulX-FlashHead」を提案します。ストリーミングシナリオにおける音声特徴の不安定さを解決するために、Temporal Audio Context Cache メカニズムを備えた Streaming-Aware Spatiotemporal Pre-training を導入し、短時間の音声断片からの確実な特徴抽出を保証します。さらに、長期シーケンスによる自己回帰生成に内在する誤差蓄積とアイデンティティドリフトを軽減するために、Ground-truth motion priors を活用して正確な物理的指針を提供する Oracle-Guided Bidirectional Distillation を提案します。また、堅固なトレーニングをサポートするために、782 時間もの厳密にアライメントされた映像を含む大規模・高品質なデータセット「VividHead」も紹介しています。広範な実験により、SoulX-FlashHead が HDTF および VFHQ ベンチマークで州最優の性能を達成したことが示されました。特に、私たちの Lite バリアントは NVIDIA RTX 4090 1 つで inference 速度 96 FPS を実現し、視覚的な連動性を犠牲にすることなく超高速なインタラクションを可能にしました。

Original Content

arXiv:2602.07449v1 Announce Type: new Abstract: Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.