arxiv_cs_cv 2026年2月10日

DINOv3 の訓練なしファーストショットセグメンテーションを通じた、Foundation モデルにおける语义選択ギャップの顕在化

Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

Translated: 2026/3/15 18:04:10

dinov3few-shot-segmentationvision-transformerself-supervised-learningsemantic-selection-gap

Japanese Translation

arXiv:2602.07550v1 Announce Type: new Abstract: 最近の自己教師あり学習 Visual Transformer（ViT）、DINOv3 などは、密集型ビジョンタスクに富んだ特徴表現を提供する。本研究では、クラス固有のプロトタイプと Gram-matrix refinement を利用した訓練フリーなベースライン FSSDINO を通じて、凍結された DINOv3 特徴の内在的なファーストショットセグメンテーション（FSS）能力を調査する。バイナリー、マルチクラス、およびクロスドメイン（CDFSS）ベンチマークでの我々の結果は、このミニマルなアプローチが最終バックボーンレイヤーに適用され、複雑なデコーダやテストタイムアダ PT による専門的な手法と比較して非常に競争力があることを示している。我々は、Oracle によるガイドされたレイヤー解析を行い、標準の最終レイヤー特徴と globally optimal intermediate representations 間の顕著な性能ギャップを特定する。我々は「Safest vs. Optimal」のジレンマを明らかにする: Oracle がより高い性能が達成可能であることを示唆しているが、現在の無教師ありやサポートガイダンス選択指標は、常に最終レイヤーベースラインよりも低い性能を出力する。これは Foundation モデルにおける「Semantic Selection Gap」を定義し、従来のヒューリスティックが忠実な特徴を確実に特定する能力に欠けていることを表す。我々の仕事は「Last-Layer」を欺瞞的に強力なベースラインとして確立し、DINOv3 の潜在的な语义ポテンシャルに対する厳密な診断を提供する。コードは https://github.com/hussni0997/fssdino に公開されている。

Original Content

arXiv:2602.07550v1 Announce Type: new Abstract: Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.