arxiv_cs_cv 2026年2月10日

地理的推論駆動による文脈無視しない遠隔センシングセマンティックセグメンテーション

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

Translated: 2026/3/15 19:05:15

remote-sensingsemantic-segmentationmultimodal-large-language-modelsgeospatial-reasoningopen-vocabulary

Japanese Translation

arXiv:2602.08206v1 Announce Type: new 要約: オープンバocabulaireセマンティックセグメンテーションは、遠隔センシングにおける有望な研究分野として台頭し、事前定義されたカテゴリ集を超える多様な土地被覆タイプの認識を可能にしました。しかし、既存の方法は、視覚的特徴とテキスト埋め込みの受動的なマッピングに主に依存しており、この「appearance-based」のパラダイムは、類似したスペクトル特性を持つが異なるセマンティック属性を持つ土地被覆类别に出会う際に、重大なセマンティックの曖昧さと分類ミスを招く地理的文脈認識能力の欠如に陥ります。これを解決するために、マルチモーダル大型言語モデル（MLLM）のシーン理解能力を向上させ、オープンバocabulaireセグメンテーションモデルを正確なマッピングへと導くことを目的とした、地理的推論クエントオブ Thought（GR-CoT）フレームワークを提案します。このフレームワークは、オフライン知識ディストリッショナルストリームとオンラインインスタンス推論ストリームの 2 つの協調するコンポーネントから構成されています。オフラインストリームは、類似した土地被覆タイプの間のセマンティックの衝突を解決するために、微細なカテゴリ解釈基準を確立します。オンライン推論の過程では、フレームワークは、マクロシナリオアンカー、ビジュアル特徴の解偶、そして知識駆動決定合成を含む順序推論過程を実行します。この過程は、下流モデルが正しい地理セマンティクスとピクセルレベルの対齊を達成するように、画像適応的なバocabulaireを生成します。LoveDA と GID5 ベンチマークにおける大規模な実験は、私たちのアプローチの優位性を示しています。

Original Content

arXiv:2602.08206v1 Announce Type: new Abstract: Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.