arxiv_cs_lg 2026年2月10日

効率的な表現は、制御可能な表現である

Efficient Representations are Controllable Representations

Translated: 2026/3/15 14:48:22

llminterpretabilitygradient-descentfeature-geometryfine-tuning

Japanese Translation

arXiv:2602.07828v1 Announce Type: new 要約：モデルのアクティベーションに解釈可能で制御可能な特徴をインストールするための最も暴力的な方法は何か？LLM が内部で概念をどのように表現するかを制御するには、通常、モデルの既存の特徴幾何学的構造を最初に特定し、次に介入させるための洗練された手法が必要となります。しかし、私たちはこれをすべて迂回させます。私たちは、単純な補助損失で LLM をフィニートーン Tuning し、16 つの残差ストリームの次元を無効な解釈可能性フラグとして訓練しました。これらのフラグは、生成に必要とされる概念を単に示すだけのものです。モデルは、実際の生成タスク中にこれらのフラグに依存することを学習して、それらを中心にした再編成を行います。その結果、これらの無効なフラグが真の内部特徴に変化します：推論時における生成を操縦可能にすることを可能にする、解釈可能な制御スイッチです。これはなぜ機能するのでしょうか？特徴が固定された位置で確実に供給されるとき、勾配降下は他の場所での冗長なエンコーディングを徐々に排除し、モデルは自身の代替表現を侵食させます。モデルのエフィシエンス圧は、解釈可能で制御可能な表現を誘発するために利用可能なレバーです。

Original Content

arXiv:2602.07828v1 Announce Type: new Abstract: What is the most brute-force way to install interpretable, controllable features into a model's activations? Controlling how LLMs internally represent concepts typically requires sophisticated methods to first identify, then intervene on the model's existing feature geometry. We bypass all of this. We finetune an LLM with a simple auxiliary loss, training 16 of its 3072 residual stream dimensions to be inert interpretability flags that simply indicate what concepts are required for generation. The model reorganizes around them anyway, learning to rely on these flags during actual generation tasks. As a result, these inert flags become genuine internal features: interpretable control switches that allow us to steer generation at inference time. Why does this work? When a feature is reliably supplied at a fixed location, gradient descent gradually eliminates redundant encodings elsewhere, and the model erodes its own alternative representations. A model's efficiency pressure is a lever - exploitable to induce interpretable, controllable representations.