arxiv_cs_cv 2026年2月10日

MosaicThinker: 具体 AI における機能的な空間思考のための反復的空間表現構築によるオンデバイス視覚的空間推論

MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

Translated: 2026/3/15 17:04:47

embodied-aivisual-spatial-reasoningmosaickinkervlmon-device-ai

Japanese Translation

arXiv:2602.07082v1 Announce Type: new 抽象: 具体 AI が従来の物体検出と認識から、より高度なロボット操作と作動計画へと拡張していくにつれ、ビデオ入力から視覚的空間推論を行うことは、物体の空間的関係を知覚し、デバイスの動作を導くために不可欠です。しかし、既存の視覚言語モデル（VLM）は、3D 空間情報に関する知識が不足しているため、空間推論能力が非常に弱く、特に複数のビデオフレームを跨る複雑な空間関係が関与する推論タスクにおいては顕著です。この論文では、オンデバイス具体 AI のための新しい推論時計算技術、すなわち \\emph{MosaicThinker} を提示します。これは、困難なクロスフレーム推論タスクにおいて、オンデバイス的小型 VLM の空間推論能力を強化します。私たちの基本アイデアは、複数のフレームからの断片的な空間情報を統合し、グローバルなセマンティックマップという統一された空間表現を作成することであり、その後、このセマンティックマップを通じて VLM の空間推論を視覚プロンプトによって導き出します。実験結果は、我々の手法が、資源に制限された具体 AI デバイスにおいて、多様な種類と複雑さを持つ推論タスクにおけるクロスフレーム空間推論の精度を大幅に向上させられることを示しています。

Original Content

arXiv:2602.07082v1 Announce Type: new Abstract: When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.