arxiv_cs_cv 2026年4月20日

GIST：知的な意味的トポロジーを通じた多式態的知識抽出と空間的アンカリング

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

Translated: 2026/4/20 10:47:28

gistmultimodalspatial-groundingvision-language-modelshuman-ai-interaction

Japanese Translation

arXiv:2604.15495v1 Announce Type: cross Abstract: レットail ストア、倉庫、病院のような複雑で密集した環境をナビゲートすることは、人間とエンボディメント型 AI にとって重大な空間的アンカリングの課題です。これらの空間では、静的な性質を持つ物に鑑み、密集した視覚的特徴はすぐに陳腐化し、長尾的な意味分布は従来のコンピュータビジョンを困難にします。視覚 - 言語モデル（VLMs）は、意味的に豊かな空間をナビゲートするための補助システムを支援しますが、カオスの多い環境における空間的アンカリングについてはまだ困難です。当研究では、GIST（Grounded Intelligent Semantic Topology：空間的知的意味的トポロジー）と呼ばれる、コンシューマーグレードモバイルポイントクラウドを多式態的知識抽出パイプラインによって、意味的に注釈付きのナビゲーショントポロジーへと変換するプロセリングを示します。当私たちのアーキテクチャは、シーンを 2D 占有マップに絞れ、そのトポロジカルなレイアウトを抽出し、知覚的なキーフレーム選択と意味選択を通じて軽量な意味層を重ね合わせます。我々は、この構造化された空間的知識の多様性を、以下の重要な Downstream Human-AI 相互作用タスクを通じて示しました：(1) 正確な一致に失敗する際、カテゴリー的な代替案とゾーンを能動的に推測する意欲駆動型のセマンティック検索エンジン；(2) 1.04 メーターの top-5 平均転送誤差を実現するワンショットセマンティックローカライザー；(3) 歩けるフロア計画を高レベルの意味的領域に分割するゾーン分類モジュール；そして (4) 視覚的にアンカーされた指示生成器が、自己中心的でランドマークに富んだ自然言語ルートを合成する。複数の基準 LLM 評価において、GIST はシークンスベースの指示生成ベンチマークを上回りました。最後に、現地で実施された形成評価（N=5）は、口頭のアタールのみによって依存する 80% のナビゲーション成功率を示し、システムのユニバーサルデザインの能力を検証しました。

Original Content

arXiv:2604.15495v1 Announce Type: cross Abstract: Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.