arxiv_cs_ai 2026年2月10日

深さに依存するジャイアント・ガウンド: 実践を困難にする大規模言語モデルへのジャイアント・ガウンド

ShallowJail: Steering Jailbreaks against Large Language Models

Translated: 2026/3/7 12:20:03

jailbreaklarge-language-modelsalignment-problemsshallow-alignment

Japanese Translation

大規模言語モデル（LLMs）は多くの分野で成功を収めてきました。それに合わせて、通常のAlignが彼らから危害的に行い方を防ぐ目的でした。しかし、Aligned LLMSもまだ、ジャイアント・ガウンドからの誤導に脆弱性があり、彼らを意図的に危険な出力へと誘引し続けることを許すために。いくつかの現行のジャイアント・ガウンドは既知の黒箱から始まるか、あるいは資源消費が激しい強力な計算により対応する白箱です。これらの課題を乗り越えるため、我々はShallowJailという新しい攻撃を開発しました。これは、LLMsに最初のトカケーテンプを乱すことでジャイアント・ガウンドに対抗するのです。本 experiments では、そのeffectiveness 深さに与える shallow を大規模なLLM応答の安全性を大幅に低下させることを示しました。

Original Content

arXiv:2602.07107v1 Announce Type: cross Abstract: Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of~\shallow, which substantially degrades the safety of state-of-the-art LLM responses.