arxiv_cs_gr 2026年4月17日

因果推定のための反事实性ペプチド編集

Counterfactual Peptide Editing for Causal TCR--pMHC Binding Inference

Translated: 2026/4/17 3:27:06

immune-systemmachine-learningbioinformaticstcrcausal-inference

Japanese Translation

arXiv:2604.13256v1 Announce Type: cross 要約: TCR-pMHC 結合予測のためのニューラルモデルは、短絡学習の影響を受けやすい: 訓練データに含まれる架空の相関（例: ペプチド長バイアス、V 遺伝子との共起現）を利用し、実際の物理的な結合界面を学習しない。これにより、このような短絡が適用されない「ファミリー除外（family-held-out）評価」や、距離感知評価において、予測は脆弱になる。我々は、生化学的制約に基づく反事实的なペプチド編集を生成し、アンカー以外の位置での編集に対する不変性を強制すると同時に、MHC アンカーアミノ酸に対する感度を増幅させる「反事象不変予測（Counterfactual Invariant Prediction, CIP）」というトレーニング枠組みを提案する。CIP は基底的な分類器に、(1) 保守的なアンカー以外の置換下における予測変化をペナルティ化する不変性損失、および (2) アンカー位置の乱乱下における予測変化を促進する対照的損失の 2 つの補助目的を追加する。カチュラった VDJdb-IEDB ベンチマークにおけるファミリー除外、距離感知、ランダムな分割評価で検証したところ、CIP は困難なファミリー除外プロトコルにおいて AUROC 0.831 と反事象一貫性 (CFC) 0.724 の結果を示し、短絡指標は非制約される基準に対して 39.7% も減少している。アブレーションスタディは、アンカーアウェアな編集生成が OOD（分布外）利得の主要な駆動力であることを確認し、因果に基づいた TCR の特異性モデリングの実用的なレシピを提供する。

Original Content

arXiv:2604.13256v1 Announce Type: cross Abstract: Neural models for TCR-pMHC binding prediction are susceptible to shortcut learning: they exploit spurious correlations in training data -- such as peptide length bias or V-gene co-occurrence -- rather than the physical binding interface. This renders predictions brittle under family-held-out and distance-aware evaluation, where such shortcuts do not transfer. We introduce \emph{Counterfactual Invariant Prediction} (CIP), a training framework that generates biologically constrained counterfactual peptide edits and enforces invariance to edits at non-anchor positions while amplifying sensitivity at MHC anchor residues. CIP augments the base classifier with two auxiliary objectives: (1) an invariance loss penalizing prediction changes under conservative non-anchor substitutions, and (2) a contrastive loss encouraging large prediction changes under anchor-position disruptions. Evaluated on a curated VDJdb-IEDB benchmark under family-held-out, distance-aware, and random splits, CIP achieves AUROC 0.831 and counterfactual consistency (CFC) 0.724 under the challenging family-held-out protocol -- a 39.7\% reduction in shortcut index relative to the unconstrained baseline. Ablations confirm that anchor-aware edit generation is the dominant driver of OOD gains, providing a practical recipe for causally-grounded TCR specificity modeling.