arxiv_cs_cv 2026年4月24日

Faker なのか本当なのか、ロボットが判断できるのか？単眼視覚ロボティクスにおける VLM のドメインシフトに対する頑丈性の評価

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding

Translated: 2026/4/24 19:52:48

vlmroboticsdomain-shiftvision-language-modelsscene-understanding

Japanese Translation

arXiv:2506.19579v3 Announce Type: replace-cross 要約: 視覚言語モデル（VLM）が環境を自然言語で記述するために、ロボティクスにおけるシーン認識が徐々に依存するようになってきています。本研究では、腕が操作するロボットマニピュレータによって撮影された机の上のシーンに対する単眼物体キャプションを体系的に評価し、実際の世界的な道具と、形状としては類似するがテクスチャ、色、材質が異なる幾何学的に似た 3D プリンタ製類似体を対比させる制御された物理的なドメインシフトを導入しました。複数の指標を対象として、複数の地部署用の最先端 VLM ベンチマークを行い、語義的な整合性と事実に基づく記述を評価しました。私達の結果は、VLM が一般的な現実世界の物体を効果的に記述しながらも、3D プリンタ製物体に対して著しく性能が低下する（構造的には慣れ親しんだ形式にもかかわらず）ことを示しています。さらに、標準的な評価指標に存在する決定的な脆弱性を暴露しており、一部の指標はドメインシフトを完全に検出できない、あるいは事実的に誤っているが流暢なキャプションを賞賛する、ことを示しました。これらの発見は、エムベッドド型エージェント（Embodied Agents）へのファウンデーションモデルの展開における限界を浮き彫りにし、物理ロボティクス応用においてより頑丈なアーキテクチャと評価プロトコルの必要性を強調しています。

Original Content

arXiv:2506.19579v3 Announce Type: replace-cross Abstract: Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.