arxiv_cs_ai 2026年4月24日

テキストと感情だけでは効果的な意味のアイコン的ジェスチャー予測：ロボットの共同音声生成へ

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

Translated: 2026/4/24 20:35:10

roboticsgesture-recognitiontransformeremotion-detectionco-speech

Japanese Translation

arXiv:2604.11417v2 発表タイプ: replace-cross 摘要: 共同音声ジェスチャーは関与度を増やし、音声理解を向上させます。多くのデータ駆動型のロボットシステムが、リズム的な拍子のような動きを生成するものの、意味的な強調を統合しているものは少ないです。これを解決するために、私たちはテキストと感情のみから意味のアイコン的ジェスチャーの位置と強度を導出する、軽量トランスフォーマーを提案します。推論時には音響入力は不要です。このモデルは、BEAT2 データセット上で、意味のジェスチャー位置分類および強度回帰の両方で GPT-4o を上回り、計算機上もコンパクトで、エンボディドエージェントへのリアルタイムデプロイに適しています。

Original Content

arXiv:2604.11417v2 Announce Type: replace-cross Abstract: Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.