arxiv_cs_cv 2026年2月10日

MomaGraph：Vision-Language モデルに基づく状態意識型統一次元グラフによるエンバウディッドタスク計画

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Translated: 2026/3/15 16:06:29

roboticsembodied-aiscene-graphvision-language-modelreinforcement-learning

Japanese Translation

arXiv:2512.16909v2 Announce Type: replace 本文書では、家庭環境でのモビリティ・ロボティクス（移動系操作ロボット）において、ナビゲーションと操作を同時に行うための、対象の位置、機能、操作可能な部位を包括的に表す緊密な文脈表現の必要性に迫ります。従来のシーングラフアプローチは空間的・機能的関係の分離、静的な画像としての扱い、そして現在のタスク達成に関連する情報の欠如といった問題を抱えてきました。この課題に対処するために、我々は空間機能的関係と部分レベルの相互作用要素を統合する MomaGraph という統一次元表現を導入しました。このような表現の進展には適したデータと厳密な評価が必要だが、これらが不足していたため、私たちは家庭環境における豊かにアノテートされたタスク駆動型シーングラフの大規模データセットである MomaGraph-Scenes と、高レベルの計画から微細な文脈理解までの 6 つの推論能力を覆う体系的な評価スイートである MomaGraph-Bench を提供しました。これに基づき、MomaGraph-Scenes で強化学習を施した 7 億パラメータの Vision-Language モデルである MomaGraph-R1 をさらに開発しました。MomaGraph-R1 は任務指向型シーングラフを予測し、Graph-then-Plan フレームワークのもとではゼロショットタスクプランナーとして機能します。大規模な実験により、我々のモデルがオープンソースモデルの中で最新結果を達成し、ベンチマーク上で 71.6% の精度（最良ベースラインより 11.4% 上昇）を記録したこと、および公共のベンチマークでの汎用性と、実際のロボット実験での高い転移性が確認されました。

Original Content

arXiv:2512.16909v2 Announce Type: replace Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.