arxiv_cs_cv 2026年4月24日

ガイダンスと Chain-of-Thought 推論を用いたマルチモーダルモデルにおけるマルチスペクトルデータの unlocked

Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

Translated: 2026/4/24 19:40:40

remote-sensingmulti-spectral-datalarge-multi-modal-modelschain-of-thoughtgeospatial-ai

Japanese Translation

arXiv:2604.21032v1 Announce Type: new 要約：マルチスペクトル画像は、土地利用・被ば類分類や環境監視などのリモートセンシング応用において価値のある入力シグナルです。しかし、汎用的な大型マルチモーダルモデル（LMM）は通常 RGB 画像を用いて訓練されるため、RGB 領域に限定されがちです。同時に、マルチスペクトル用マルチモーダルモデルを訓練するのは高価であり、ユニークな専門性を有したモデルしか生み出せません。これを解決するため、我々は標準的な RGB 専用 LMM の推論パイプライン内でマルチスペクトルデータを導入する、新しい訓練不要アプローチを提案しました。この手法により、大規模な性能向上が達成されます。我々のアプローチは、LMM の視覚空間理解を利用し、非 RGB 入力をその空間に適応させ、ドメイン固有の情報と Chain-of-Thought 推論を指示として注入することで実現しています。Gemini 2.5 モデルを用いてこれを実証し、人気のリモートセンシングベンチマークにおいて強い Zero-Shot 性能向上を確認しました。これらの結果は、ジオ空間専門家が強力な汎用モデルを専門性の高いセンサー入力に活用でき、専門データに基づいた豊富な推論能力から恩恵を受けられることを示しています。

Original Content

arXiv:2604.21032v1 Announce Type: new Abstract: Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.