arxiv_cs_lg 2026年2月10日

The Confidence Manifold: 言語モデルにおける正確性の表現の幾何学的構造

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Translated: 2026/3/15 15:02:37

language-modelsgeometric-structuremachine-learningneural-networksrepresentation-learning

Japanese Translation

arXiv:2602.08159v1 Announce Type: new 摘要：言語モデルが「オーストラリアの首都はシドニーである」と主張した場合、それが間違っていることを知っていますか？5 つのアーキテクチャ族からなる 9 モデルの正確性表現の幾何학을特徴付けます。その構造は単純です：識別信号は 3〜8 次元を占め、次元数が増加すると性能が低下し、非線形クラシファイヤーは線形分離を超えません。低次元部分空間における重心距離はトレーニング済みのプローブ性能（0.90 AUC）と一致し、フェショウ・ショット検出を可能にします。GPT-2 上で 25 つのラベル付けされたサンプルで、完全データ精度の 89% を達成します。アクティベーション・スチーリングを通じた因果的検証を行います：学習された方向は誤り率を 10.9 パーセントポイント変化させますが、ランダムな方向には影響が見られません。内部プローブは 0.80〜0.97 AUC を達成し、出力に基づく手法（P(True)、セマンティックエントロピー）は 0.44〜0.64 AUC のみです。正確性の信号は内部に存在しますが、出力には表現されません。その重心距離がプローブ性能と一致することから、クラス分離は平均シフトであり、検出は幾何学的かつ学習されたものではありません。

Original Content

arXiv:2602.08159v1 Announce Type: new Abstract: When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.