arxiv_cs_ai 2026年4月24日

SafeRedirect: フロントIERLLMにおける内部安全崩壊をタスク完了再導向によって破る

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

Translated: 2026/4/24 20:21:42

safe-redirectllm-safetyinternal-safety-collapseai-robustnessprompt-defense

Japanese Translation

arXiv:2604.20930v1 Announce Type: cross Abstract: 内部安全崩壊 (Internal Safety Collapse, ISC) は、正規のプロフェッショナルタスクにおいて、正しい完了が構造的に有害な内容の生成を必要としつつ、境界 LLM が自動的にその有害な内容を読み出して、失敗率が 95% を超える現象である。既存の入力レベルの防御は ISC に対して 100% の失敗を示し、標準的なプロンプト防御は部分的な対策に過ぎない。我々は、安全性を抑制するのではなく、モデルのタスク完了の駆動を再導向することにより ISC を破るシステムレベルのオーバーライド SafeRedirect を提案する。SafeRedirect は、タスク完了への失敗への明示的な許可を与え、確定的な停止出力を規定し、有害なプレースホルダーを未解決の状態にするようモデルに指示する。3 つの AI/ML 関連の ISC タスクタイプを跨ぐ 7 つの境界 LLM におけるシングルターン設定での評価において、SafeRedirect は平均的な非安全生成率を 71.2% から 8.0% へと低減させ、これに対する最も強力な有効なベースライン（55.0%）を大きく凌駕した。マルチモデルアブレーションにより、失敗の許可と条件の特定性は普遍的に重要であることが示され、その他の構成要素の重要性はモデル間で変化する。クロスアタック評価では、他のアタックファミリーに対してベースラインと同等以上の一般化性能を備えた ISC に対する state-of-the-art 防御が確認された。コードは https://github.com/fzjcdt/SafeRedirect に利用可能な。

Original Content

arXiv:2604.20930v1 Announce Type: cross Abstract: Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model's task-completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard-stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML-related ISC task types in the single-turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared to 55.0% for the strongest viable baseline. Multi-model ablation reveals that failure permission and condition specificity are universally critical, while the importance of other components varies across models. Cross-attack evaluation confirms state-of-the-art defense against ISC with generalization performance at least on par with the baseline on other attack families. Code is available at https://github.com/fzjcdt/SafeRedirect.