arxiv_cs_cv 2026年2月10日

Vision-Language モデルに対するユニバーサルマルチモーダル攻撃の階層的洗練

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Translated: 2026/3/15 16:07:02

adversarial-attacksmultimodal-learningvision-language-modelsgradient-refinementuniversal-perturbation

Japanese Translation

arXiv:2601.10313v2 Announce Type: replace 要約: 既存の VLP モデル（ビジョン・ランゲージ・モデル）のための敵対的攻撃は、主にサンプル固有であり、それを大規模なデータセットや新しいシナリオに拡張する際の計算上のオーバーヘッドが著しく大きくなっています。この限界を克服するために、私たちは VLP モデル向けのマルチモーダルユニバーサル攻撃フレームワークである「階層的洗練攻撃（Hierarchical Refinement Attack, HRA）」を提案します。画像モードについては、過去の勾配と推定された未来勾配を用いた時間的階層を活用して最適化経路を洗練させ、局所極小点を避けてユニバーサル摂乱の学習を安定化させます。テキストモードについては、文内および文間への貢献を考慮し、テキストの重要性を階層的にモデル化することで、グローバルに影響力を持つ語句を特定し、それをユニバーサルテキスト摂乱として利用します。様々なダウンストリームタスク、VLP モデル、およびデータセットにわたる大規模な実験は、提案されたユニバーサルマルチモーダル攻撃の優秀な汎用性を示しています。

Original Content

arXiv:2601.10313v2 Announce Type: replace Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.