arxiv_cs_cv 2026年2月10日

Vision-Language Modelsにおける証拠バランスを考慮したTest-Time AdaptationのためのFair Context Learning

Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models

Translated: 2026/2/11 13:41:29

Japanese Translation

arXiv:2602.07027v1 発表種別: new Abstract: Vision-Language Models (VLMs) such as CLIPは強力なzero-shot認識を可能にする一方で、分布シフト下で大幅に性能が劣化する。Test-Time Adaptation (TTA)はラベルなしのテストサンプルのみを用いて頑健性を向上させることを目指すが、ほとんどのprompt-based TTA手法はentropy minimization（エントロピー最小化）に依存している。エントロピー最小化はスプリアスな相関を増幅させ、クラス間で視覚的特徴が共有される場合に過度に自信を持った誤りを誘発しうる。我々はFair Context Learning (FCL)を提案する。FCLはエピソディックなTTAフレームワークであり、entropy minimizationを回避しつつshared-evidence bias（共有証拠バイアス）に明示的に対処する。additive evidence decomposition assumption（加法的証拠分解仮定）に基づき、FCLは適応処理を次の2段階に分離する: (i) augmentation-based exploration（増強に基づく探索）によって妥当なクラス候補を特定する段階、および (ii) fairness-driven calibration（公平性駆動の較正）によってテキストコンテキストを適応させ、共通の視覚的証拠に対する感度を平準化する段階である。この公平性制約により部分的特徴への固執が緩和され、entropy reduction（エントロピー低減）に依存することなくtext embeddingsの効果的な較正が可能となる。広範な評価を通して我々は理論的な動機付けを実証的に検証し、FCLが多様なドメインシフトおよび細粒度ベンチマークにおいて最先端のTTA手法に対して競争力のある適応性能を達成することを示す。

Original Content

arXiv:2602.07027v1 Announce Type: new Abstract: Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.