arxiv_cs_lg 2026年4月24日

Black-box 大規模言語モデルの医療予測解釈のための代理モデリング

Surrogate modeling for interpreting black-box LLMs in medical predictions

Translated: 2026/4/24 20:04:45

surrogate-modelingllm-interpretabilityblack-boxmedical-aiprompt-engineering

Japanese Translation

arXiv:2604.20331v2 Announce Type: cross 要約：巨大なデータセットで訓練された大規模言語モデル（LLMs）は、パラメータ内に広範な実世界の知識をコード化していますが、そのブラックボックス的な性質により、このコード化のメカニズムと範囲は不明瞭です。複雑な系统进行する代理モデリング（簡素化されたモデルを使用して複雑な系统进行を近似する手法）は、ブラックボックスモデルの解釈性向上の道筋を提供できます。我々は、LLM encoded された知識を定量的に説明するための代理モデリングフレームワークを提案しました。ドメイン知識から導かれた特定の仮説に基づき、このフレームワークは、包括的なシミュレーションシナリオにおける大規模なプロンプトを介して、観測可能な要素（入力 - 出力ペア）を使用して潜在的な LLM 知識空間を近似します。医療予測における概念検証実験を通じて、我々は、各入力変数が出力に関連する LLM が「感知」する程度を明らかにする我々フレームワークの有効性を示しました。特に、LLM が訓練データ中に埋め込まれている不正確さと社会的バイアスを永続化させる懸念がある場合、我々のフレームワークを使用した実験は、LLM encoded された知識内の確立された医学的知識に反する関連性、そして科学的に否定されている人種差別仮説の存続を、定量的に明らかにしました。我々のフレームワークはこれらの問題を明らかにすることで、これらのモデルの安全で確実な適用をサポートするレッドフラグ指標として機能できます。

Original Content

arXiv:2604.20331v2 Announce Type: cross Abstract: Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.