arxiv_cs_lg 2026年2月10日

Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Translated: 2026/3/15 9:05:18

large-language-modelssafety-alignmentfine-tuningneural-subspacesrlhf

Japanese Translation

arXiv:2505.14185v3 Announce Type: replace Abstract: 大規模言語モデル (LLMs) は、社会的に受け入れ可能な回答を生成するために安全アライメントに依存します。しかし、この振る舞いは脆く知られています：無害または軽微に汚染されたデータに対してもさらに微調整を行うと、安全性が低下し、有害な振る舞いが再導入されてしまいます。近年の多くの研究では、アライメントは重み空間において識別可能な方向に対応し、原則として分離または保全されて誤アライメントから防御できるサブ空間を形成することを提唱しています。この作業では、この視点に関する包括的な実証研究を執取りました。私たちは、安全関連の振る舞いが特定の線形サブ空間に集中的にあるかどうか、一般目的の学習から分離できるかどうか、そして害悪性が活性化における区別できるパターンから生じるかどうかを検討しました。重み空間と活性化空間の両方において、私たちの発見は一貫しており：安全な振る舞いを増幅するサブ空間は有用な振る舞いも増幅し、異なる安全的意涵を持つプロンプトは重複する表現を活性化します。私たちは、安全性が特定の方向に居住するのではなく、モデルの一般学習成分と高度に絡み合っていると示しました。これは、サブ空間に基づく防御が基本的な限界を持っていることを示唆し、継続的なトレーニング下での安全性を保全するための代替戦略の必要性を強調します。これらの発見は、Llama 系および Qwen 系からの 5 つのオープンソース LLM に対する多数の実験によって裏付けられました。私たちのコードは以下の URL で公開されています：https://github.com/CERT-Lab/safety-subspaces.

Original Content

arXiv:2505.14185v3 Announce Type: replace Abstract: Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.