arxiv_cs_cv 2026年2月10日

限られたペアリングデータにおける LLM 指導型診断証拠整合 Medical Vision-Language Pretraining

LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

Translated: 2026/3/15 18:04:00

llmvision-language-pretrainingmedical-aiclipself-supervised-learning

Japanese Translation

arXiv:2602.07540v1 Announce Type: new 摘訳：既存の CLIP スタイルの医療画像 - 言語自己学習法は、大量のペアリングデータに依存したグローバルまたはローカルの整合性を使用しています。しかし、グローバルな整合性は診断的非診断情報に支配されやすく、ローカルの整合性は重要な診断証拠の統合を失敗します。その結果、信頼できる診断表現の学習は困難になり、これらは限られたペアリングデータの医療シナリオにおける適用性を制限します。この課題に対処するために、私達は LLM 指導型診断証拠整合法（LGDEA）を提案します。この方法は、医療診断プロセスとより一貫している証拠レベルの整合性への自己学習目標をシフトさせます。具体的には、私達は LLM を使用して放射線報告から重要な診断証拠を抽出し、共有診断証拠空間を構築し、証拠認識クロスモーダル整合性を可能にし、LGDEA が豊富な非ペアリングの医療画像と報告書を効果的に活用できるようにすることで、ペアリングデータへの依存を大幅に軽減します。大規模実験結果は、私たちの方法がフレーズアンカー、画像 - テキスト検索、そしてゼロショット分類で一貫したかつ有意な改善を達成し、大量のペアリングデータに依存する自己学習法さえも凌駕することを示しています。

Original Content

arXiv:2602.07540v1 Announce Type: new Abstract: Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.