arxiv_cs_cv 2026年4月20日

Polyglot: 言語スタイルを保持するマルチリンガル音声駆動顔面アニメーション

Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

Translated: 2026/4/20 10:45:20

diffusion-modelsspeech-driven-facial-animationmultilingual-aigenerative-aiface-animation

Japanese Translation

arXiv:2604.16108v1 Announce Type: new 摘要: 音声駆動顔面アニメーション (SDFA) は、映画、ビデオゲーム、および仮想現実などのアプリケーションにより注目を集めています。しかし、既存の多くのモデルは単一言語のデータで訓練されており、現実世界のマルチリンガルシナリオでの効果は制限されています。本研究では、言語が発音、リズム、イントネーション、そして表情に影響を与えるため、リアルな生成のために不可欠であるマルチリンガル SDFA に対処します。話し方は、言語だけでなく個人の特性によっても形成されます。既存の方法は、言語固有のコンディショニングまたは話者固有のコンディショニングに頼っていますが、両方ではありません、これにより相互作用をモデル化する能力が制限されています。当方では、パーソナライズドマルチリンガル SDFA 用の統一されたディフュージョンベースアーキテクチャである Polyglot を導入します。当社の方法は、トランスクリプト埋め込みを用いて言語情報を符号化し、参照顔面シークエンスから抽出されたスタイル埋め込みを用いて個人の話しの特性を捉えることで、前定義された言語や話者ラベルが不要です。自己監督学習を通じて、言語と話者への汎用性を可能にします。言語とスタイルを同時にコンディショニングすることで、リズム、発音、そして慣習的な顔面動きなどの表現的特徴を捉え、時間的に整合性がありリアルなアニメーションを生成します。実験结果表明了在単一言語およびマルチリンガル設定でのパフォーマンスの向上があり、SDFA における言語と個人スタイルのモデル化のための統一されたフレームワークを提供します。

Original Content

arXiv:2604.16108v1 Announce Type: new Abstract: Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.