arxiv_cs_cv 2026年4月24日

VFM$^{4}$SDG: VFMs の力を引き出す、単独ドメインに依存しない汎用オブジェクト検出の解明

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

Translated: 2026/4/24 19:44:37

sdgodvision-foundation-modelsdomain-shiftobject-detectiondeep-learning

Japanese Translation

arXiv:2604.21502v1 発表タイプ：新規要旨：現実世界のシナリオにおいて、気象、照明、および撮影条件の継続的な変化は大きなドメインシフトを引き起こし、単一ソースドメインで訓練された検出器は未見環境で大きく性能を低下させる。既存の単独ドメインに依存しない汎用オブジェクト検出（SDGOD）手法は主にデータ拡張やドメイン不変表現学習に依存しており、検出器の機構には注力されていないため、複雑なドメインシフト条件下では明確な限界が生じている。分析実験を通じて、私たちは検出器の性能低下がミスの検出数の増加に支配されていることを見出した。これは根本的に検出器のドメイン間での変動性の低下に起因しており、エンコーディング段階ではオブジェクトと背景、およびインスタンス間の関係性が不安定になり、デコーディング段階ではクエリ表現のセマンティック・空間的整合性の維持も難しくなっている。このため、我々は SDGOD に対して「VFM$^{4}$SDG」という双方向の先決学習フレームワークを提案し、それを検出器の表現学習およびクエリモデリングに導入した。これは、転用可能なドメイン間安定性の先決（prior）として、凍結された視覚的基础モデル（VFM）を採用している。エンコーディング段階では、我々は「ドメイン間安定な関係性先決ディストリル」という手法を提案し、オブジェクトと背景、およびインスタンス間の関係性モデリングの頑健性を向上させた。デコーディング段階では、「セマンティック・文脈的先行に基づくクエリ強化」を提案し、カテゴリレベルのセマンティックプロトタイプとグローバルな視覚的文脈をクエリに注入することで、未見ドメインにおけるセマンティック認識および空間定位の安定性を改善した。広範な実験结果显示，我々の提案手法は標準的な SDGOD ベンチマークおよび 2 つの主流の DETR 基盤検出器で既存の最上位手法を一貫して凌駕しており、その効果，頑健性，および汎用性を証明した。

Original Content

arXiv:2604.21502v1 Announce Type: new Abstract: In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.