arxiv_cs_ai 2026年4月24日

SafeMERGE: セレクティブな層別モデルマージによるファインチューニングされた大規模言語モデルにおける安全性アライメントの維持

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Translated: 2026/4/24 20:31:53

mergingfine-tuningsafety-alignmentllmmachine-learning

Japanese Translation

arXiv:2503.17239v3 Announce Type: replace-cross 要旨: 大規模言語モデル（LLM）のファインチューニングは、汎用モデルを特殊化されたドメインに適応させるための一般的なプラクティスです。しかし、最近の研究によると、ファインチューニングは安全性アライメントを侵食させ、LLM が有害または倫理的でないプロンプトに反応する原因となる可能性があります。安全性を再アライン化する手法は多数提案されてきましたが、多くの場合、実装が困難なカスタムアルゴリズムを導入したり、タスクの有用性を犠牲にする傾向があります。本稿では、安全性を修復しながら下流パフォーマンスを維持する、軽量なファインチューニング後のフレームワーク SafeMERGE を提案します。SafeMERGE は、余計な安全行為から外れた層（余計な安全性行為は、余計な安全性の指標である余計な類似度の基準によって測定される）のみを、余計な安全性の層とセレクティブにマージします。4 つの LLM と複数のタスクにおいて、SafeMERGE は他の防御策と比較して有害な出力を一貫して削減し、有用性への影響は無視できるか、むしろ陽性の影響を与えます。私達の結果は、セレクティブな層別マージがファインチューニングに伴う安全性の意図外的な喪失に対する頑健な防護策であることを示し、SafeMERGE を単純でありながら効果的なファインチューニング後の防御策と確立しました。

Original Content

arXiv:2503.17239v3 Announce Type: replace-cross Abstract: Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that restores safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across four LLMs and several tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective, layer-wise merging offers a robust safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple yet effective post-fine-tuning defense.