arxiv_cs_lg 2026年2月10日

AlphaSteer: 理論に基づいた空部分と制約を用いた拒否ステアリング学習

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Translated: 2026/3/15 9:06:03

activation-steeringllm-safetyjailbreak-attackmachine-learningprompt-response

Japanese Translation

LLM はますます実世界アプリケーションに展開されるようになり、悪意のあるプロンプト、特にジェイルブレイク攻撃に対して拒否能力を確保することは、安全かつ信頼性の高い使用に不可欠です。最近、推論中に LLM の内部活性化に拒否方向ベクトルを追加する活性化ステアリングが、LLM セーフティを強化する効果的なアプローチとして台頭しました。ただし、 indiscriminately 的に活性化ステアリングを適用することは、安全性と有用性の間のトレードオフから根本的に苦しみます。同じステアリングベクトルは、悪意あるプロンプトに対しての拒否行動を誘発するだけでなく、良性的プロンプトに対する過剰拒止と性能低下にもつながる可能性があるからです。以前、ベクトル校正や条件付きステアリングなど、このトレードオフを緩和しようとした取り組みはありますが、その理論的根拠の欠如は柔軟性と実効性を制限しています。安全性と有用性の間のトレードオフをより良く解決するために、AlphaSteer という、理論に基づき実証的に効果的な活性化ステアリング手法を提案します。具体的には、活性化ステアリングを 2 つの原理的な学習目標、すなわち有用性の保持と安全性の強化を持つ学習過程と捉えます。有用性の保持のためには、ステアリングのデータを近似的に 0 ベクトルとするための制約を空部分（null-space）に基づいて学習します。安全性の強化のためには、線形回帰の助けを借りて、悪意あるデータをステアリングするための拒否方向ベクトルを学習します。複数のジェイルブレイク攻撃と有用性ベンチマークにわたる実験が、AlphaSteer の効果を実証しており、LLM の安全性を大幅に向上させることなく、一般的な能力を損なわないことを示しています。当社のコードは https://github.com/AlphaLab-USTC/AlphaSteer に入手可能です。

Original Content

arXiv:2506.07022v2 Announce Type: replace Abstract: As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.