arxiv_cs_cv 2026年2月10日

XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

Translated: 2026/2/11 13:19:54

Japanese Translation

arXiv:2602.07017v1 発表タイプ: new 概要: 医用画像セグメンテーションは臨床ワークフローにおいて正確な診断、治療計画、疾患のモニタリングを可能にする重要な要素です。しかし、transformer-based models が convolutional architectures を上回る優れた性能を示しているにもかかわらず、解釈性の不足は臨床での信頼獲得と導入の大きな障壁となっています。既存の explainable artificial intelligence (XAI) 手法（gradient-based saliency methods や perturbation-based approaches を含む）は、計算コストが高く、多数のフォワードパスを必要とし、しばしばノイズの多い、あるいは解剖学的に無関係な説明を生成します。これらの制約に対処するために、本研究では XAI-CLIP を提案します。XAI-CLIP は ROI-guided perturbation framework であり、multimodal vision-language model embeddings を活用して臨床的に意味のある解剖学的領域を局在化し、説明過程を導きます。言語に基づく領域局在化を医用画像セグメンテーションと統合し、ターゲット化された region-aware perturbations を適用することで、本手法はより明瞭で境界を意識した saliency maps を生成しつつ、計算オーバーヘッドを大幅に削減します。FLARE22 および CHAOS データセットでの実験により、XAI-CLIP は従来の perturbation 手法と比較してランタイムを最大で 60% 削減し、dice スコアを 44.6% 向上させ、occlusion-based explanations における Intersection-over-Union を 96.7% 増加させることを示しました。定性的結果も、アーティファクトの少ないよりクリーンで解剖学的に一貫した attribution maps を確認しており、multimodal vision-language representations を perturbation ベースの XAI フレームワークに組み込むことが、解釈性と効率の両面で大幅に向上させ、透明性のある臨床導入可能な医用画像セグメンテーションシステムを実現することを強調しています。

Original Content

arXiv:2602.07017v1 Announce Type: new Abstract: Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.