arxiv_cs_cv 2026年4月20日

LLaMo：Continuous Autoregressive Token を用いた統一された_motion_理解和生成のための Pretrained Language Model スケーリング

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Open original article

Translated: 2026/4/20 10:50:59

llamamultimodal-generationmotion-understandingcontinuous-autoregressiveflow-matching

Japanese Translation

arXiv:2602.12370v2 Announce Type: replace Abstract: 最近の大規模モデルの進展により、統合的なマルチモーダル生成と理解への大幅な進歩をもたらしました。しかし、運動と言語の生成と理解を統合するモデルの開発は、まだあまり進んでいません。既存のアプローチは、大規模言語モデル（LLM）をパaired motion-text データ上でファインチューニングすることによって、利用可能なテキスト - 運動ペアの規模の限界により、言語能力の катастрофическое 忘却（catastrophic forgetting）をもたらす可能性があります。さらに、過去の手法は、言語モデルとの統合のために運動を量子化された離散的な表現に変換することによって、離散的トークナライゼーションによる顕著なジャッターアーチファクトを導入しています。これらの課題に対処するため、私たちは Pretrained LLM を、モダリティ固有の Mixture-of-Transformers (MoT) アーキテクチャを通じて拡張する統合的なフレームワークである LLaMo を提案しました。この設計は、ベースモデルの言語理解を内在的に保持し、スケーラブルなマルチモーダル適応を可能化します。私たちは人間の運動を因果連続的な潜在空間にエンコードし、軽量な Flow-matching Head を介して Decoder-only バックボーンで Next-token 予測のパラダイムを維持することで、リアルタイムでストリーミング動作生成（>30 FPS）を可能にします。Pretrained LLM の包括的な言語理解と、大規模な Motion-text Pretraining を活用した実験では、LLaMo が一般設定で高忠実度の Text-to-Motion 生成と Motion-to-Text キャプションを作成することを示しており、特に Zero-shot 運動生成において顕著な成果を示しました。これは、一般的な統合された Motion-Language Large Model への重要なステップです。

Original Content

arXiv:2602.12370v2 Announce Type: replace Abstract: Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.