arxiv_cs_cv 2026年4月20日

DualTrack: センサーレス 3D ウルトラサウンドにはローカルコンテキストとグローバルコンテキストの両方が必要

DualTrack: Sensorless 3D Ultrasound needs Local and Global Context

Translated: 2026/4/20 10:50:04

ultrasoundcomputer-visiondeep-learningmedical-imaging3d-reconstruction

Japanese Translation

arXiv:2509.09530v2 Announce Type: replace 要約：3 次元超音波（US）は従来の 2 次元画像化に比べて多くの臨床的な利点を有していますが、その大規模な採用は従来の 3 次元システムの高価さと複雑さによって制限されています。連続した 2 次元 US 画像のシーケンスから 3 次元プローブの経路を推定するための深層学習を用いた「センサーレス 3D US」は有望な代替手段です。ローカルの特徴（例：スッペルパターン）はフレームごとの運動の予測に役立ち、一方でグローバルの特徴（例：粗い形状や解剖学的構造）は検査を解剖学的構造との相対位置に配置し、その一般的な形状の予測に役立ちます。過去の手法では、グローバル特徴は無視されたり、ローカル特徴抽出と強く連動されたりしていましたが、これによりこれらの両方の補完的な側面を堅牢にモデル化することが困難でした。私たちは、それぞれの特徴抽出の規模に特化した解離されたローカル・グローバルエンコーダーを活用する新しい DualTrack 構想を提案します。ローカルエンコーダーは密集した空間時間共鳴を用いて細粒度の特徴を捕捉し、グローバルエンコーダーは画像バックボーン（例：2D CNN やファウンデーションモデル）と時間的な注意層を用いて、高レベルの解剖学的特徴と長期的な依存関係を埋め込みます。軽量化された融合モジュールがこれらの特徴を組み合わせ、経路を推定します。大規模な公開ベンチマークでの実験結果は、DualTrack が州外最優の精度とglobally consistent（globally 一貫性のある）3 次元再構成を実現し、以前の手法を上回り、平均再構成誤差が 5 mm 以下であることを示しています。

Original Content

arXiv:2509.09530v2 Announce Type: replace Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.