arxiv_cs_cv 2026年4月20日

PixDLM: UAV 推論分割のための双路マルチモーダル言語モデル

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Translated: 2026/4/20 10:41:21

pixdlmuavreasoning-segmentationmultimodal-language-modelremote-sensing

Japanese Translation

arXiv:2604.15670v1 発表タイプ：新規要旨：推論セグメンテーションは最近、地面レベルのシーンからリモートセンシング画像へと拡張されており、非直視点、超解像度、極端なスケール変異を含む UAV データは独自の課題を呈しています。これらの問題を解決するため、当方は UAV 推論セグメンテーションタスクを形式化し、その半義的要件を 3 つの次元、すなわち空間的、属性、シーンレベルの推論に整理しました。この形式化に基づき、10,000 枚のハイエゾリゾール航空画像と 3 つの推論類型すべてにわたる Chain-of-Thought QA サーパービジョンを伴った大規模ベンチマーク DRSeg を構築しました。ベンチマークの相棒として、単一なベースラインとなるピクセルレベルのマルチモーダル言語モデル PixDLM を導入しました。DRSeg 上の実験は強力なベースライン結果を示し、UAV 推論セグメンテーションの独自の課題を浮き彫りにし、今後の研究に堅固な土台を提供しました。

Original Content

arXiv:2604.15670v1 Announce Type: new Abstract: Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.