arxiv_cs_cv 2026年4月24日

UHR-DETR: 超解像度遠隔センシング画像のための効率的エンド・ツー・エンド小物体検出

UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

Translated: 2026/4/24 19:43:56

det-objectremote-sensingtransformercomputer-visionimage-recognition

Japanese Translation

arXiv:2604.21435v1 Announce Type: new Abstract: 超解像度（UHR）画像は、現代の遠隔センシングにおいて不可欠なものとなり、過去に見ない空間的カバレッジを提供しました。しかし、如此巨大なシーンにおいて小物体を検出することは、根本的なジレンマを引き起こします。小物体のオリジナル分解能を保持しようとするのは、禁止的なメモリボトルネックを引き起こすからです。逆に、画像スキャリングやパッチクリッピングなどの従来の妥協策は、小物体を消去したり、文脈を破壊したりします。このジレンマを解消するために、当稿では UHR-DETR、すなわち UHR 画像に設計された効率的なエンド・ツー・エンドトランスフォーマーベース検出器を提案します。第一に、情報を最大化するスパースエンコーダを導入しました。これは限られた計算リソースを、情報量の大きい高分解像度領域に動的に割り当て、最小限の空間的冗長性を伴いながら最大限の物体カバレッジを実現します。第二に、グローバル・ローカル・デカップルド・デコーダーを設計しました。このモジュールは、マクロ的なシーン認識とマイクロ的な物体詳細を統合し、言語的曖昧性を解決し、シーンのfragmentationを防ぎます。UHR 画像データセット（例: STAR, SODA-A）に基づく大規模実験により、厳格なハードウェア制約下（例: 単一 24GB RTX 3090）で UHR-DETR の優位性が確認されました。STAR データセットにおいて、標準的なスライディングウィンドウベースラインとの比較で、10 倍の推論速度アップを維持しつつ、2.8% の mAP 改善を達成しました。当方のコードとモデルは、https://github.com/Li-JingFang/UHR-DETR に利用可能です。

Original Content

arXiv:2604.21435v1 Announce Type: new Abstract: Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at https://github.com/Li-JingFang/UHR-DETR.