arxiv_cs_cv 2026年4月24日

Divide-then-Diagnose: Clinician-Inspired コンテキストを織り込む超長期カプセル内視鏡映像へのアプローチ

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Translated: 2026/4/24 19:46:55

capsule-endoscopyvideo-summarizationmedical-imagingcomputer-visiondiagnosis

Japanese Translation

arXiv:2604.21814v1 Announce Type: new 摘要：カプセル内視鏡（CE）は非侵襲的な消化管スクリーニングを可能にしますが、現在の CE 研究は主にフレームレベルの分類と検出に限定されており、映像レベルの解析は未探求のままです。このギャップを埋めるため、我々は臨床的に意味のある発見を覆い、それらの証拠フレームから正確な診断を作成することを必要とする新しいタスクである「診断駆動型 CE 映像要約」を導入し、形式的に定義しました。この設定は、診断的に関連するイベントが極めて稀であり、数万枚の冗長な正常フレームに押し潰される可能性があるだけでなく、個々の観察がモーションブロー、埃、鏡面反射、および急速な視点変化によって曖昧になるという点で困難です。この方向における研究を促進するため、我々は実際の臨床報告から派生した診断駆動型注釈が付与された最初の CE データセットである VideoCAP を導入しました。VideoCAP は 240 本のフル長さの映像を備え、主要な証拠フレームの抽出および診断に対して現実的な監督を提供します。このタスクに対処するため、我々は標準的な CE 読解ワークフローを反映する医師 inspired フレームワークである DiCE をさらに提案しました。DiCE はまず生映像上の効率的な候補スクリーニングを実行し、その後、Distinct 病変イベントを保持した協調的な診断的文脈へ候補をまとめる「コンテキスト・ウェブアー」を使用し、各文脈内のマルチフレーム証拠を頑健なクリップレベル判断へ統合する「エビデンス・コンバーガー」を使用します。実験結果は、DiCE が最優方法（SOTA）を一定に凌駕し、簡潔で臨床的に信頼性の高い診断要約を生み出すことを示しました。これらの結果は、診断駆動型コンテキスト的推論を超長期 CE 映像要約のための有望なパラダイムであることを示唆しています。

Original Content

arXiv:2604.21814v1 Announce Type: new Abstract: Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.