arxiv_cs_cv 2026年2月10日

Visual Prompt-Agnostic Evolution

Translated: 2026/3/15 16:07:32

visual-prompt-tuningvision-transformermachine-learningdeep-learningneural-networks

Japanese Translation

arXiv:2601.20232v2 Announce Type: replace Abstract: Visual Prompt Tuning (VPT) は、各レイヤーのトークンシーケンスに可学習プロンプトトークンを読み込むことで、固定されたビジョントランスフォーマー（ViT）をダウンストリームタスクに適応させます。しかし、我々は既存の VPT バージョンが不安定な学習ダイナミクス、特に勾配の振動にしばしば苦しんでいることに気づきました。レイヤーごとの分析により、浅いレイヤーのプロンプトが早期に停滞する一方で、深いレイヤーのプロンプトは高変動の振動を示し、これはクロスレイヤーミスマッチをもたらします。これらの問題は収束を遅延させ、最終的な性能を低下させます。これらの課題に対処するために、我々はプロンプトダイナミクスを明示的にモデル化することでビジョンプロンプトチューニングを強化する Prompt-Agnostic Evolution ($\mathtt{PAE}$) を提案します。周波数ドメインの観点からは、バックボーンが認識のために内在的に利用する周波数ショートカットパターンを発見して伝達することで、プロンプトをタスク認識の方向に初期化します。レイヤーを超えた一貫性の進化を確保するために、我々はグローバルな線形変換を適用するための共有クーパーマン演算子を採用し、無調整なレイヤー固有の更新に替えます。最後に、ライアプノフ安定性理論に刺激され、進化過程における誤り増幅を制約する制約項を導入しました。大規模な実験において、$\mathtt{PAE}$ は平均 1.41 倍の高速化を達成し、ダウンストリームタスクの 25 データセットで 1〜3% の精度向上をもたらしました。性能だけでなく、$\mathtt{PAE}$ はプロンプト非認知であり軽量であり、バックボーン変更や推論時の変更なしに多様な VPT バージョンとシームレスに統合可能です。

Original Content

arXiv:2601.20232v2 Announce Type: replace Abstract: Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ($\mathtt{PAE}$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that $\mathtt{PAE}$ accelerates convergence with an average $1.41\times$ speedup and improves accuracy by 1-3% on 25 datasets across multiple downstream tasks. Beyond performance, $\mathtt{PAE}$ is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.