arxiv_cs_cv 2026年2月10日

テキストと同様に画像を読み解く：VLM における並列的な画像理解

Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Translated: 2026/3/15 13:02:37

vision-language-modelsimage-understandingspatial-perceptiontoken-compressionrope-scaling

Japanese Translation

arXiv:2509.19191v2 発表タイプ：置換抜粋：視覚言語モデル（VLM）は、様々な実世界タスクにおいて驚くべきパフォーマンスを示しています。しかし、既存の VLM は画像情報をシリアライズする手法で処理しており、人間の視覚の並列的な性質とは大きく異なります。さらに、その不透明な内部機構は、より深い理解やアーキテクチャの革新を阻害しています。人間の視覚の二流仮説に基づく「何を」（what）と「どこを」（where）の経路を区別するという発想に触発され、VLM の画像処理を物体認識と空間認識に分解して個別に研究しました。物体認識については、画像をテキストトークンマップに変換し、モデルが画像の内容を理解するプロセスが浅い層から深い層へ段階的に展開し、属性認識から始まって意味的消歧に終わるものであることを発見しました。空間認識については、VLM の位置表現の背後にある幾何학적構造を理論的に導出し、実証的に検証しました。これらの発見に基づき、Plug-and-Play 視覚デコーダーに基づく指示非依存トークン圧縮アルゴリズムを導入してデコード効率を向上させ、RoPE スケーリング技術を用いて空間推論能力を強化しました。厳密な実験を通じて、当作品はこれらの分析を検証し、VLM の内部構造についてより深い理解を提供するとともに、能力をより高い未来のアーキテクチャを設計するための明確な原理を示しました。

Original Content

arXiv:2509.19191v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.