arxiv_cs_cv 2026年4月20日

Vision-Language Models の失敗箇所は？画像ジオロケーションのための世界規模解析

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

Translated: 2026/4/20 10:46:27

vision-language-modelsimage-geolocalizationzero-shot-reasoningmultimodal-learningspatial-understanding

Japanese Translation

arXiv:2604.16248v1 発表種別：新規要旨：画像ジオロケーションは従来、レトリバルベースの場所認識パイプラインや幾何学ベースのビジュアルローカリゼーションパイプラインを通じて対処されてきました。最近の Vision-Language Models (VLMs) の進歩は、マルチモーダルタスクにおいて強いゼロショット推論能力を示しましたが、地理的推論におけるその性能は十分に研究されていません。本稿では、国レベルの画像ジオロケーションにおいて、地上視点の映像のみを使用して、複数の最新 VLM を系統的に評価します。画像のマッチング、GPS メタデータ、またはタスク特異的なトレーニングに依存しない代わりに、ゼロショット設定でプロンプトベースの国予測を評価します。選択されたモデルは、3 つの地理的多様性を備えたデータセットでテストされ、その堅牢性と汎化能力が評価されます。私たちの結果は、モデル間における著しい変化を示し、粗粒度のジオロケーションにおける意味的推論の可能性と、現在の VLMs が細粒度の地理的情報を捉えることにおける限界を浮き彫りにしました。この研究は、現代の VLM による国レベルのジオロケーションに関する初の集中的な比較を提供し、マルチモーダル推論と地理的理解の交差点における今後の研究の基礎を確立しました。

Original Content

arXiv:2604.16248v1 Announce Type: new Abstract: Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.