arxiv_cs_ai 2026年4月24日

Causal Interventions to Neural Networksからの対立する表現への対処

Addressing divergent representations from causal interventions on neural networks

Translated: 2026/4/24 20:33:13

causal-interventionsneural-networksmechanistic-interpretabilitycounterfactual-lossdistribution-shift

Japanese Translation

arXiv:2511.04638v5 Announce Type: replace-cross 要約：メカニスティック・インタープリタビリティの一般的なアプローチは、モデルの内部表現を理解するため、標的された介入を通じてその表現を因果的に操作することです。ここでは、そのような介入が分布外（＝対立的な）表現を生み出すかどうか、そしてそれによる、結果としての説明が標的モデルの自然な状態に対する忠実性への懸念が生じるかどうかについて問います。まず、我々は一般的因果介入手法が多くの場合、標的モデルの自然な分布から内部表現を逸らすことを理論的に、そして実証的に示しました。次に、我々はこのような対立の 2 つのケース、すなわち、関心のある層（複数）の行動ノル空間で起こり「無害」な対立、そして、不活動な行動変化を引き起こす隠れたネットワーク経路を活性化させ「有害」な対立、について理論的分析を行いました。最後に、有害なケースを軽減しようとして、Grant (2025) の Counterfactual Latent (CL) 損失を適用し、修正を行いました。これにより、因果介入からの表現が自然な分布に近づくようにし、有害な対立の可能性を低減しつつ、介入の解釈力を保持しました。これら結果は、より信頼性の高いインタープリタビリティ手法への道筋を示しました。

Original Content

arXiv:2511.04638v5 Announce Type: replace-cross Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.