arxiv_cs_ai 2026年4月20日

Trace Rewritingによるモデルの不正蒸馏に対する防御：教師生成の推理トラースを改変する技術

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Translated: 2026/4/20 11:16:23

knowledge-distillationllm-securityapi-watermarkingtrace-rewritingmodel-defense

Japanese Translation

arXiv:2602.15143v2 Announce Type: replace 摘要：知識蒸馏は、大規模言語モデル（LLM）の機能を、より小さく効率的な学生モデルへ転移させる広範に採用されている手法である。しかし、不正な知識蒸馏の利用は、最先端モデルの開発に費やされた莫大な努力とコストの不公正な利用を招く。本稿では、権益侵害を阻む二つの目的——（1） extit{アンチディスタル}（学習有用性を低下させること）、（2） extit{API ウォータマーキング}（学生モデルに検証可能な署名を埋め込むこと）——を達成するために、教師が生成する推理トラースをどのように改変するかを検討する。回答の正解性と文脈的整合性を維持しつつ、教師の推理出力を動的に書き換えるための複数の手法を紹介する。そのうち、2 つの手法は LLM の書き換え機能を利用し、残りは勾配ベースの技法を用いている。実験结果表明、単純な指示ベースの書き換え手法が、強いアンチディスタル効果をもたらすだけでなく、教師モデルのパフォーマンスを維持あるいは向上させることができることを示している。また、当社の書き換え手法は、実質的な誤検知なしに信頼して検出可能なウォータマーキングの埋め込みを可能にすることも示された。ソースコードは以下の場所から入手可能である: https://github.com/xhOwenMa/trace-rewriting.

Original Content

arXiv:2602.15143v2 Announce Type: replace Abstract: Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.