arxiv_cs_cv 2026年2月10日

VLRS-Bench: Remote Sensing 向けの Vision-Language Reasoning ベンチマーク

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Translated: 2026/2/11 13:48:36

Japanese Translation

arXiv:2602.07045v1 公開タイプ: new 要旨: 最近の Multimodal Large Language Models (MLLMs) の進展により複雑な推論が可能になった。しかし、既存の remote sensing (RS) ベンチマークは、物体認識やシーン分類などの知覚（perception）タスクに著しく偏っている。この制約は、認知的に高度な RS アプリケーション向けの MLLMs の開発を阻害している。これに対処するため、本研究では Vision Language ReaSoning Benchmark (VLRS-Bench) を提案する。VLRS-Bench は複雑な RS 推論に専念した初のベンチマークである。Cognition、Decision、Prediction の3つのコア次元にわたって構成され、VLRS-Bench は平均 71 語の長さを有する2,000件の質問応答ペアで構成され、14 のタスクと最大 8 つの時間段階（temporal phases）を跨いでいる。VLRS-Bench は、地理空間的現実性（geospatial realism）と推論の複雑性を担保するために、RS 特有の priors と専門家知識を統合した専用のパイプラインにより構築された。実験結果は既存の最先端 MLLMs における重大なボトルネックを明らかにしており、リモートセンシングコミュニティにおけるマルチモーダル推論の前進に対する重要な示唆を提供する。

Original Content

arXiv:2602.07045v1 Announce Type: new Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.