arxiv_cs_ai 2026年2月10日

情報理論に基づくグラフ融合とマルチモーダルモデルによる政策推論とダブルロボット制御

Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

Translated: 2026/2/14 7:16:47

Japanese Translation

ロボットの複雑なスキルを人間のビデオから教えることは、低レベルのターゲット追跡の依存性によって困難です。これらのターゲットは、異なる物体、場所構成レイアウト、および制御器配置で一般化できずです。我々はGraph-Fused Vision-Language-Action（GF-VLA）フレームワークを提出しました。これは両手ロボットシステムがRGBとDepthからの人間のデモンストレーションから直接タスクレベルの Reasoningを行えるようにします。 GF-VLAはShannon情報に基づいたシナリオでの識別を行い、これらの要素は手と物体間の相互作用を含む時間列でグラフ化されます。これらが両方の手によって選択された政策を作成するために融合されます。また、多様なグローバル制御設定に対して最適なグripperを選択するためには多角的な推論は必要ありません。GF-VLAを含む4つの構造化されたダブル-アームのブロックセット作業タスクで評価されました。これらの結果は、情報理論に基づくシナリオ表現が95パーセント以上のグラフと93パーセントの子タスク分割に達したことを示しました。これにより、LLMプランナーに対する信頼性の高いと読みやすいタスクポリシーを生成するようにサポートされています。ダブルアームロボットによって実行されたこれらの政策は、特定の組み立て、アルファベットビルディング、幾何学的な再構成にわたる94パーセントの握り成功、89パーセントの置き場確率および全体タスク90パーセントの成功を報告しました。これは多様なスペースと意味で一般的化が強力であり、安定性があります。

Original Content

arXiv:2508.05342v2 Announce Type: replace-cross Abstract: Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.