arxiv_cs_cv 2026年2月10日

分割画像による悪意のある入力攻撃に対する視覚言語モデルの頑健性

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Translated: 2026/3/15 19:04:55

vision-language-modelssafety-alignmentadversarial-attacksblack-box-transferadversarial-knowledge-distillation

Japanese Translation

arXiv:2602.08136v1 発表タイプ：新規本文書において、我々は視覚言語モデル（VLM）における新たな脆弱性を特定します。VLM の前訓練およびインストラクションチューニングは分割画像の入力に対して非常に汎化性能が高いことが知られていますが、安全アライメント（Safety Alignment）は通常、ホリスティック（全体の）画像に対してのみ行われ、複数の画像断片に分散して配置された有害な意味論を考慮していません。その結果、VLM は、複数の画像を結合することで初めて不安全问题が現れる分割画像の入力を検知し、拒絶することができない場合が多くなります。我々は、この不整合を悪用する新規の分割画像視覚ジャイルブレイク攻撃（SIVA）を提案しました。従来の最適化ベースの攻撃がモデル間の構造的および先验的な不整合によりブラックボックスでの転送性が低いのに対し、我々の攻撃は、直感的な分割から適応型ホワイトボックス攻撃、そしてブラックボックス転送攻撃へと段階的に進化します。我々の最強の戦略は、クロスマデル転送性を大幅に向上させるための新規の敵対的知識蒸馏（Adv-KD）アルゴリズムを活用しています。最新 3 つの VLM と 3 つのジャイルブレイクデータセットを用いた評価では、我々の最強な攻撃が既存の基準に対して転送成功率が最大 60% 向上することを示しました。最後に、我々は現在行われている VLM の安全アライメントにおいて、この極めて重要な脆弱性を効率的に対処する方法も提案しました。

Original Content

arXiv:2602.08136v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.