arxiv_cs_cv 2026年4月24日

RefAerial: 航空写真における指示検出のためのベンチマークとアプローチ

RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images

Translated: 2026/4/24 19:52:33

referring-detectionaerial-imagingmachine-learningcomputer-visionbenchmark

Japanese Translation

arXiv:2604.20543v2 Announce Type: replace 要約: Referring detection（指示検出）は、自然言語によって参照された対象を検出するという課題であり、近年急速に研究関心を集めている。しかし、既存のデータセットは、オブジェクトが小規模な風景の中心に配置された地上写主に限定されている。本論文では、航空写真における指示検出用の大規模かつ挑戦的なデータセット「RefAerial」を提案する。RefAerial は、従来の地上指示検出データセットと比較して、以下の 4 つの特性で区別される：(1) オブジェクトに対する風景比率が低いが多様である、(2) 多数の対象と分散物（distractors）を有する、(3) 複雑かつ微細な指示記述を含む、(4) 航空視点における多様で広範な風景である。また、我々は効率的な半自動的な指示対注釈のための「人間をループに含めた指示拡張と注釈エンジン」(REA-Engine) を開発した。さらに、既存の地上指示検出アプローチが我々の航空データセット上で本質的なスケールの多様性問題により、重大なパフォーマンス低下を示すことを観察した。したがって、我々は航空写真における指示検出用の新規「スケール包括的かつ敏感」(SCS) フレームワークを提案した。SCS フレームワークは、ミキシング・オブ・グランユリティ（粒度）(MoG) 注意力機構と、2 段階の包括的-from-敏感（CtS）解読戦略から構成されている。具体的には、ミキシング・オブ・グランユリティ注意力機構は、スケール包括的な目標理解のために開発された。また、2 段階の包括的-from-敏感解読戦略は、概略から微細へ渡る指示目標の解読のために設計された。最終的に、提案された SCS フレームワークは我々の航空指示検出データセットで顕著なパフォーマンスを達成し、従来の地上指示検出データセットさえも有望なパフォーマンス向上をもたらす。

Original Content

arXiv:2604.20543v2 Announce Type: replace Abstract: Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.