arxiv_cs_cv 2026年2月10日

UrbanGraphEmbeddings: スペースに裏付けられたマルチモーダルエンベッディングの学習と評価：都市科学のために

UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

Translated: 2026/3/16 14:04:29

urbansciencemultimodalembddingspatialreasoningvlmgraphencoding

Japanese Translation

arXiv:2602.08342v1 発表タイプ：新規要旨：都市環境のための汎用マルチモーダルエンベッディングの学習は困難です。なぜなら都市理解は本質的に空間的であるにもかかわらず、既存のデータセットやベンチマークには、街並み画像と都市構造との明示的な整合性が欠如しているからです。本研究では、街並み画像を構造化された空間グラフにアンカーし、空間推論パスや空間的文脈キャプションを通じて距離性、方向性、接続性、近郊文脈といった画像内容を超えた情報を提供する「UGData」という空間に裏付けられたデータセットを導入します。UGData を基に、指示指向的コントラスト学習とグラフベース的空间エンコーディングを組み合わせることで、画像、テキスト、空間構造を漸進的かつ安定して整列させる「UGE」という二段階学習戦略を提案します。最後に、空間に裏付けられたエンベッディングが多様な都市理解タスク（地理座標ランクリング、画像検索、都市認識、空間アンカーリングなど）を支援する程度を評価する包括的なベンチマーク「UGBench」を導入します。UGE は Qwen2-VL、Qwen2.5-VL、Phi-3-Vision、LLaVA1.6-Mistral などの最新大規模視覚言語モデル（VLM）の複数バリアント上で開発され、LoRA タイニングを適用して定数次元の空間エンベッディングを訓練しました。Qwen2.5-VL-7B バックボーンに基づく UGE は、トレーニング都市において画像検索において最大 44%、地理座標ランクリングにおいて 30% の向上、そしてテスト都市においてそれぞれ 30% および 22% の増大を実現し、明示的な空間アンカーリングが空間依存度の高い都市タスクに効果を示していることを示唆しています。

Original Content

arXiv:2602.08342v1 Announce Type: new Abstract: Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.