arxiv_cs_cv 2026年4月24日

ビジョン・ラングェージモデルにおけるプロトタイプベースのテストタイム適応

Prototype-Based Test-Time Adaptation of Vision-Language Models

Translated: 2026/4/24 19:43:22

prototype-based-test-time-adaptationvision-language-modelstest-time-adaptationclipknowledge-distillation

Japanese Translation

arXiv:2604.21360v1 Announce Type: new 要約：テストタイム適応（TTA）は、事前トレーニングとテストデータの分布ギャップを解消するためにビジョン・ラングェージモデル（VLM）に有望なパラダイムとして台頭しました。最近の研究は、カッシュベースの設計に依存するバックプロパゲーションフリーな TTA メソッドに焦点を当ててきましたが、これらは 2 つの主要な限界を伴います。まず、クラスの数に応じてキャッシュが大きくなると推論の遅延が増加し、大規模な環境における非効率を生み出します。第二に、キャッシュに含まれるサンプルが不十分または誤っている場合、性能が最適化されません。本論文では、クラス固有の知識プロトタイプを使用してテストサンプルからの知識を蓄積する、効率的かつ効果的な TTA パラダイムであるプロトタイプベースのテストタイム適応（PTA）を提唱します。特に、知識プロトタイプは各テストサンプルのゼロショットクラス信頼度に基づいて適応的に重み付けされ、サンプルの視覚的特徴が対応するクラス固有プロトタイプに統合されます。過去のテストサンプルからの知識はプロトタイプのみで統合および利用され、既存の TTA メソッドを妨げるカッシュの充填および検索のオーバーヘッドが排除されています。これにより、PTA は極めて高い効率性を達成すると同時に、15 の画像認識ベンチマークと 4 つの頑健なポイントクラウド解析ベンチマークで状態の最前線の性能を達成します。具体的には、PTA は 10 つのクロスドメインベンチマークで CLIP の精度を 65.64％から 69.38％に向上させ、大規模な ImageNet-1K で CLIP の推論速度の 92％を維持しています。一方、キャッシュベースの TDA は、精度が 67.97％のみであるとともに、CLIP の推論速度の 50％のみです。

Original Content

arXiv:2604.21360v1 Announce Type: new Abstract: Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.