arxiv_cs_ai 2026年4月24日

自然スタイルトリガーに基づく LLM に対する静かなバックドア攻撃

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Translated: 2026/4/24 20:28:13

backdoor-attackllm-securityprompt-injectionstyle-transferpeft

Japanese Translation

arXiv:2604.21700v1 Announce Type: cross 要約：大規模言語モデル（LLM）が安全関連分野での応用が増えるにつれて、そのセキュリティに関する懸念が高まっています。近年の研究では、LLM に対するバックドア攻撃の実現可能性が示されています。しかし、既存の方法には以下の 3 つの主要な欠点があります：自然さを損なう明示的なトリガーパターン、長文生成における攻撃者が指定したペイロードの注入の不可靠さ、そして実践におけるバックドアの送達と活性化の態様が不明確な不完全な脅威モデルです。これらの課題に対処するため、私たちは「BadStyle」という完全なバックドア攻撃フレームワークとパイプラインを提案します。BadStyle は LLM を汚染サンプル生成器として活用し、可視化されないスタイルレベルのトリガーを帯びつつ、文脈と流暢さを保持した自然で静かな汚染サンプルを構築します。ファインチューニング中のペイロード注入を安定させるために、攻撃者が指定した目標コンテンツを汚染された入力への応答で強化し、無害な応答でのその出現を罰する補助的な目標損失を設計しました。さらに、現実的な脅威モデルに基づいて、BadStyle を誘発プロンプトおよび PEFT ベースの注入戦略の両方下で系統的に評価しました。LLaMA、Phi、DeepSeek、GPT シリーズを含める 7 つの被害 LLM における大規模実験において、BadStyle が高い攻撃成功率（ASR）と強い静かさを両立していることが示されました。提案された補助的な目標損失はバックドア活性化の安定性を大幅に改善し、スタイルレベルのトリガーにおいて平均 30% の ASR 向上を達成しました。注入時にも知られていない下流デプロイメントシナリオにおいて、埋め込まれたバックドアは依然として効果的です。さらに、BadStyle は代表例の入力レベルの防御を回避し、簡単なカモフラージュを通じて出力レベルの防御を迂回し続けることが確認されました。

Original Content

arXiv:2604.21700v1 Announce Type: cross Abstract: The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.