arxiv_cs_cv 2026年2月10日

PAND：軽量な細粒度視覚分類のためのプロンプト感知近傍蒸留

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

Translated: 2026/3/15 18:05:59

vision-language-modelsfine-grained-classificationknowledge-distillationneural-network-architectureprompt-learning

Japanese Translation

arXiv:2602.07768v1 Announce Type: new 要約：大規模な視覚言語モデル（VLM）から軽量なネットワークに知識を蒸留することは、固定されたプロンプトとグローバルな整合性に依存しているため、細粒度視覚分類（FGVC）において重要ながら困難です。これを解決するため、我々は、セマンティックなキャリブレーションと構造の伝達を分離する二段階のフレームワークである PAND（Prompt-Aware Neighborhood Distillation）を提案します。まず、我々は PAND セマンティックキャリブレーションを導入し、適応的なセマンティックアンカーを生成します。次に、学生の局所的な決定構造を制約するために、近傍感知構造蒸留戦略を導入しました。PAND は、4 つの FGVC ベンチマークで常に最先進の方法を上回る性能を示しました。特に、我々の ResNet-18 学生モデルは、CUB-200 データセットで 76.09% の精度を達成し、強力な基準となる VL2Lite を 3.4% 上回りました。コードは https://github.com/LLLVTA/PAND に利用可能です。

Original Content

arXiv:2602.07768v1 Announce Type: new Abstract: Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.