arxiv_cs_lg 2026年2月10日

Safet Alignmen as Continual Learning: Alignmen Tax の軽減のための直交勾配投影

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Translated: 2026/3/15 14:49:08

llmalignment-taxcontinual-learningsafety-traininggradient-projection

Japanese Translation

arXiv:2602.07892v1 発表タイプ：新規要約：大型言語モデル (LLM) は、安全アライメント訓練によって一般有用性（推論やコーディングなど）が低下するという「アライメント課税」の影響を受けることが多い。我々は、この課税は主に、連続学習スタイルの忘却から生じるものであり、分布シフトと矛盾する目的が、安全な更新が事前学習の能力を覆う原因となっていると主張する。したがって、安全アライメントを連続学習 (CL) の問題として捉え、塑成型 (plasticity: 安全制約の獲得) と安定性 (stability: 一般的能力の保持) のバランスを取ることを要求する。我々は、Orthogonal Gradient Projection for Safety Alignment (OGPSA) を提案する。これは、軽量化された方法であり、安全更新を、一般的能力を捉える学習済みサブ空間に対して一階微分の意味で直交させることで干渉を軽減する。具体的に、OGPSA は小さな参照セット上の勾配から低ランクの能力サブ空間を推定し、安全勾配をその直補空間に投影してから更新を実行する。これは、以前の知識を最小限に混乱させながら、アライメントのための容量を保持する安全志向の更新を生み出す。OGPSA はプラグ＆プレイ式であり、大規模なリプレイ、補助目的、または再訓練なしに標準的なポスト訓練パイプラインに統合可能である。監督済み微調整 (SFT)、直接優位性最適化 (DPO)、そして連続的な SFT→DPO の設定において、OGPSA は標準的な基準よりも一貫して安全--有用性のペタオ前線を実証的に改善する。例えば、Qwen2.5-7B-Instruct の SFT→DPO 設定において、OGPSA は強力な安全性を保持したまま一般的能力を回復させ、SimpleQA を 0.53% から 3.03%、IFEval を 51.94% から 63.96% に改善した。我々のソースコードは \\href{https://github.com/SunGL001/OGPSA}{OGPSA} で利用可能です。

Original Content

arXiv:2602.07892v1 Announce Type: new Abstract: Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}