arxiv_cs_cv 2026年2月10日

Ex-Omni: オムニモーダル大規模言語モデル向けの 3D 顔アニメーション生成を可能にする

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Translated: 2026/3/15 17:05:11

omni-modallarge-language-models3d-animationspeechfacial-recognition

Japanese Translation

arXiv:2602.07106v1 発表タイプ：新要旨：オムニモーダル大規模言語モデル (OLLM) は、マルチモーダルな理解と生成を統合する意図を持つものの、音声と 3D 顔アニメーションを組み合わせることは、自然なインタラクションの重要性にもかかわらず、ほとんど探索されていません。この課題は、LLM における離散的なトークンレベルの文義推論と、3D 顔動きに必要な密集した微細な時間的动态との表現不整合に起因します。これは、限られたデータ条件下で直接的なモデル化を最適化するのが難しいことを意味します。我々は、OLLM を音声付随の 3D 顔アニメーションで強化するオープンソースオムニモーダルフレームワーク「Expressive Omni (Ex-Omni)」を提案します。Ex-Omni は、文義推論と時間的生成を分離すること、音声ユニットを時間的な骨組みとして活用すること、制御された文義注入のための統一されたトークン-as-query ギートドア融合 (TQGF) メカニズムを利用することで、学習の困難さを低減します。また、OLLM を音声付随の 3D 顔アニメーションで強化することを目的としたデータを促進する「InstructEx」というデータセットを導入しました。広範な実験结果显示，Ex-Omni は既存のオープンソース OLLM と競争力を示し、安定した調整された音声と顔アニメーション生成を可能にしました。

Original Content

arXiv:2602.07106v1 Announce Type: new Abstract: Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.