arxiv_cs_cv 2026年4月24日

骨格認識に基づくテキスト・視覚融合と二視点プロンプトによる脊髄骨の微細分類セグメンテーション

Anatomy-Aware Text-Visual Fusion with Dual-Perspective Prompts for Fine-Grained Lumbar Spine Segmentation

Translated: 2026/4/24 19:49:17

medical-imagingspine-segmentationmultimodal-fusioncontrastive-learningdeep-vision

Japanese Translation

arXiv:2504.03476v2 発表タイプ：置き換え概要: 正確な脊髄骨セグメンテーションは、脊椎疾患の診断にとって不可欠です。既存の手法は、正確な診断に必要な微細なディテールを欠く粗粒度セグメンテーション戦略を用いており、また視覚のみを頼りにする点から、解剖学的意味を捉えることに難しさを抱えています。これにより、分類エラーやセグメンテーションディテールの低下が生じています。これらの限界に対処するため、私たちは脊椎骨の微細構造（すなわち、脊椎 (VB)、椎間板 (IDs)、脊柱管 (SC)）の詳細なセグメンテーションを実現する、革新的な ATM-Net 枠組みを提案しました。ATM-Net は、解剖学認識に基づく、テキスト誘導型のマルチモーダル融合機構を採用しています。ATM-Net は、解剖学認識テキストプロンプト生成器 (ATPG) を用いて、視点を問わず画像注釈を解剖学認識プロンプトに適応的に変換します。これらの洞察は、全体解剖学認識セマンティック融合 (HASF) モジュールを通じて画像特徴と統合され、包括的な解剖学的コンテキストが構築されます。さらに、チャンネル指向対比解剖学認識強化 (CCAE) モジュールは、クラス識別を向上させ、クラス指向チャンネルレベルマルチモーダル対比学習を通じてセグメンテーションを洗練させます。MRSpineSeg および SPIDER データセットで行われた大規模実験により、ATM-Net が最先进の手法を大幅に上回ることを示しました。クラス識別とセグメンテーションディテールに関して一貫した改善が見られました。例えば、ATM-Net は SPIDER データセットで Dice 0.7939、HD95 9.91 pixels を達成し、競争力のある SpineParseNet をそれぞれ 8.31% と 4.14 pixels 上回りました。

Original Content

arXiv:2504.03476v2 Announce Type: replace Abstract: Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.