arxiv_cs_cv 2026年4月20日

Motion-Adapter: テキストから複合アクションの生成のための拡散モデルアダプター

Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions

Translated: 2026/4/20 10:45:34

diffusion-modelstext-to-videomotion-synthesishuman-robot-interactioncomputer-vision

Japanese Translation

arXiv:2604.16135v1 Announce Type: new Abstract: 最近の生成運動合成の進展により、多様な入力モードから現実的な人間運動の生成が可能になりました。しかし、複数の同時動作を統合した整合性のある全身体動作（複合アクション）をテキストから生成することは、依然として大きな課題です。本研究では、現在のテキストから運動生成用の拡散モデルに存在する 2 つの主要な限界を特定しました：（1） Catastrophic neglect（大惨滅）：不十分な時間情報の処理により、後続の動作が先続きの動作を上書きしてしまい、（2） Attention collapse（注意機構の崩壊）：クロスアテンション機構における過度な特徴融合から生じます。その結果、既存のアプローチは、過度に詳細なテキスト記述（例：右手を上げる）、明示的な身体部位の指定（例：上半身を編集）、または大規模言語モデル（LLMs）を用いた身体部位の解釈に依存しています。これらの戦略は、物理構造物や運動学的機構の семанти的な表現を不十分にして、歩行中の挨拶といった自然な振る舞いを統合する能力を制限してしまいます。これらの問題を解決するために、本研究では、デカップリングされたクロスアテンションマップを計算し、ノイズ除去プロセスにおける構造的マスクとして機能させ、テキストから運動生成用の拡散モデルを導くモジュールである Motion-Adapter を提案しました。大規模な実験により、我々の手法は、多様なテキストプロンプトにおいて、より忠実で整合性のある複合運動を一貫して生成し、最先端のアプローチを上回ることを示しました。

Original Content

arXiv:2604.16135v1 Announce Type: new Abstract: Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.