arxiv_cs_cv 2026年2月10日

TriC-Motion: 空間・時間・周波数ドメインの因果モデリングに基づくテキストツーモーション生成

TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

Translated: 2026/3/16 14:05:16

text-to-motioncausal-modelingdiffusionmotion-synthesiscomputer-vision

Japanese Translation

arXiv:2602.08462v1 Announce Type: new 摘要: テキストツーモーション生成は、コンピュータビジョンの急速に進化している分野で、リアルチックでテキストに整合した運動シーケンスを生成することを目指しています。現在の手法は、主に空間時間モデリングや独立した周波数ドメイン分析に焦点を当てており、空間、時間、周波数ドメイン全体で連動した最適化のための統一的な枠組みが欠如しています。この制限により、モデルは全てのドメインからの情報を利用する能力を損ない、最適ではない生成品質につながります。さらに、モーション生成の枠組みでは、ノイズによって引き起こされるモーションに無関係なクイズが、生成に寄与する特性と絡み合い、モーション歪みを生じることがあります。これらの課題に対処するために、我々は空間時間周波数ドメインモデリングと因果介入を統合した新しい拡散ベースの枠組みであるトリドメイン因果テキストツーモーション生成（TriC-Motion）を提案します。TriC-Motion は、ドメイン固有のモデリングのための 3 つの核心モジュールを含み、すなわち時間運動符号化、空間トポロジーモデリング、および混合周波数分析です。包括的なモデリングの後、スコア指向型トリドメイン融合モジュールが三重ドメインからの有用な情報を統合し、同時に時間的一貫性、空間トポロジー、運動傾向、およびダイナミクスを確保します。さらに、因果に基づいた反事実運動不整合子は、モーションに無関係なクイズを曝示してノイズを除去し、各ドメインの真のモデリング貢献を解離させるために慎重に設計されています。大規模な実験結果は、TriC-Motion が最先端の手法よりも優れたパフォーマンスを達成することを検証し、HumanML3D データセット上で 0.612 の優れた R@1 を達成しています。これらの結果は、高忠実度で整合性のある、多様な、かつテキストに整合した運動シーケンスを生成する能力を示しています。コードは以下に利用可能です: https://caoyiyang1105.github.io/TriC-Motion/.

Original Content

arXiv:2602.08462v1 Announce Type: new Abstract: Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.