arxiv_cs_cv 2026年2月10日

人間動作生成のための言語ガイド付きトランスフォーマートークナイザー

Language-Guided Transformer Tokenizer for Human Motion Generation

Translated: 2026/3/16 14:04:23

human-motion-generationlanguage-guided-tokenizationmotion-discretizationgenerative-aitransformer-tokenizer

Japanese Translation

arXiv:2602.08337v1 発表型：新規要旨：本論文では、生動作データを効率的な動作生成のために不可欠であるコンパクトな離散トークンに変換する動作離散化トークナイザーに焦点を当てています。このパラダイムにおいて、動作再構築の質を向上させる一般的なアプローチはトークン数の増加ですが、トークンが増えれば増すほど生成モデルの学習が困難になります。高再構築品質を維持しつつ生成複雑性を削減するために、本論文では言語を活用した効率的な動作トークナイザー、すなわち言語ガイド付きトークナイザー（LG-Tok）を提案します。LG-Tok は、トークナイザー段階において自然言語と動作を整合させ、コンパクトで高レベルのセマンティック表現を生成します。このアプローチはトークナイザーおよびデトークナイザーの双方を強化するとともに、生成モデルの学習を簡素化します。さらに、既存のトークナイザーは主に畳み込みアーキテクチャを採用しており、その局所的受容野はグローバルな言語ガイドをサポートするのが困難です。そのためには、言語と動作間に効果的な整合を実現するために注意機構を活用したトランスフォーマーベースのトークナイザーを提案します。また、トレーニング中に言語条件をランダムに削除する言語ドロップ方式も設計し、デトークナイザーが生成時に言語フリーガイドをサポートできるようにしています。HumanML3D と Motion-X の生成ベンチマークにおいて、LG-Tok はトップ -1 スコアが 0.542（HumanML3D）および 0.582（Motion-X）となり、最近の最上級手法（MARDM: 0.500 および 0.528）を上回り、FID スコアはそれぞれ 0.057 および 0.088 で、0.114 および 0.147 を記録しました。LG-Tok-mini は半分のトークンだけで競合的なパフォーマンス（トップ -1: 0.521/0.588、FID: 0.085/0.071）を維持しており、私たちのセマンティック表現の効率性を示唆しています。

Original Content

arXiv:2602.08337v1 Announce Type: new Abstract: In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens--a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.