arxiv_cs_cv 2026年4月20日

音声と視覚特徴空間間の感情意味ベクトルモデリングに基づくクロスモーダル感情転移：会話がめざき動画における感情編集のためのアプローチ

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Translated: 2026/4/20 10:51:49

face-video-generationemotion-editingcross-modal-transfersemantic-vectorstts-synthesis

Japanese Translation

論文 ID: arXiv:2604.07786v2 発表タイプ: 更新要旨：会話がめざき生成は生成モデルの核心的なアプリケーションとして注目を集めています。合成動画の表現性とリアルネスを高めるため、会話がめざき動画における感情編集は極めて重要な役割を果たしています。既存のアプローチは、表現の柔軟性を制限し、また拡張的感情（例：皮肉）を生成するのに苦労することがよくあります。ラベルベースの方法は感情を離散的なカテゴリで表現しますが、広い範囲の感情を捉えきれません。音声ベースの方法は感情に富んだ音声シグナルを利用し、さらに表現豊かなテキストに音声合成（TTS）合成のメリットも受けられますが、感情と言語的コンテンツは感情に富んだ話語で絡み合っているため、目標の感情を表現できません。一方、画像ベースの方法は感情転移を導くために目標参照画像に依存しますが、高品質な正面ビューを必要とし、拡張感情（例：皮肉）の参照データを入手する上で課題に直面します。これらの限界に対処するために、私たちは会話がめざき表情を生成するために音声と視覚特徴空間間の感情意味ベクトルをモデル化するクロスモーダル感情転移（C-MET）という新しいアプローチを提案します。C-MET は大規模事前学習済み音声エンコーダーと分解可能顔表情エンコーダーを活用し、モーダル間での 2 つの異なる感情エンベッディング間の差を表現する感情意味ベクトルを学習します。MEAD と CREMA-D データセットにおける広範な実験は、我々の方法が最先进の方法と比較して感情正確さを 14% 向上させ、見えない拡張感情さえもを含む表現的な会話がめざき動画を生成することを示しています。コード、チェックポイント、デモは https://chanhyeok-choi.github.io/C-MET/ で入手可能です。

Original Content

arXiv:2604.07786v2 Announce Type: replace Abstract: Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/