arxiv_cs_cv 2026年4月24日

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Translated: 2026/4/24 19:45:38

motion-understandinglarge-language-modelsbiomechanical-analysisstructured-motionszero-shot-reasoning

Japanese Translation

arXiv:2604.21668v1 Announce Type: new 摘要: テキストベースの大型言語モデル（LLM）の世界知識や推論能力は急速に進化していますが、現在の人形動作理解へのアプローチ、包括質問応答やキャプション化まで含めて、これらの能力を十分に活用していません。既存の LLM ベースの方法は、通常、LLM の埋め込み空間に人形特徴を投射する専用エンコーダーを通じて人形と言語の照合を学習しており、クロスモーダルな表現と照合に制約されています。生物力学解析において、関節角度や人体各部の運動学は既に人形の動きを正確に記述する言語として長年来機能してきたことを着想として、我々は、関節位置シーケンスを人形関節角度、人体各部の動き、およびグローバル軌跡の構造化された自然言語記述へ変換する、ルールベースかつ決定論的なアプローチである **構造化人形動作記述（Structured Motion Description：SMD）** を提案します。SMD は動きをテキストとして表現するため、LLM はエンコーダーまたは照合モジュールを要求せず、直前の学習済みの人体各部、空間的方向性、および運動の文法をそのまま人形推理に適用できます。我々は、このアプローチが人形質問応答（BABEL-QA 66.7%、HuMMan-QA 90.1%）および人形キャプション化（HumanML3D で R@1=0.584、CIDEr=53.16）における最先进の結果を超えると示しました。さらに、SMD は実用的な利点を提供します：同じテキスト入力が異なる LLM で動作し、軽量 LoRA 適応（6 モデルファミリーから 8 LLM での検証）だけで済み、人間の読み取り可能な表現は人形記述に対する解釈可能な注意力分析を可能にします。コード、データ、および事前学習 LoRA アダプタは https://yaozhang182.github.io/motion-smd/ から入手可能です。

Original Content

arXiv:2604.21668v1 Announce Type: new Abstract: The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.