arxiv_cs_cv 2026年4月20日

SurgMotion：汎用手術動画理解のためのビデオネイティブ基礎モデル

SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Translated: 2026/4/20 10:50:53

foundation-modelsurgeryvideo-analysisv-jeпаself-distillation

Japanese Translation

arXiv:2602.05638v3 Announce Type: replace 抽象：基礎モデルが手術動画解析を高度に発展させたものの、現在の手法は主にピクセルレベルの再構成目標に依存しており、煙や光沢的な反射、流体の動きなどの低レベル視覚的ディテールにモデルの容量を浪費する一方、手術理解に不可欠な構文構造には注力していない。私たちは、ピクセルレベルの再構成から潜在ベクトルの運動予測への学習パラダイムを転換する「SurgMotion」というビデオネイティブ基礎モデルを提示します。Video Joint Embedding Predictive Architecture（V-JEPA）に基づいて構築した SurgMotion は、手術動画に特化した 3 つの主要な技術革新を導入しました：(1) 意味のある領域を優先する運動誘導型潜在ベクトルマスク予測、(2) 関係性の整合性を強制する時空間アフィニティ自己ディスチルテーション、(3) テキスチャが希薄な手術シーンにおける表現崩壊を防ぐ時空間特性多様性正則化（SFDR）です。大規模事前学習を可能にするため、われわれはこれまで最も大きい手術動画データセットである SurgMotion-15M を編成し、13 の解剖学的領域を跨いで 50 のソースからなる 3,658 時間の動画を含むものです。17 のベンチマークにおける広範な実験により、SurgMotion は手術ワークフロー認識において最先端の手法を大幅に凌駕することを示しました。EgoSurgery における F1 スコアの 14.6% 向上と PitVis の 10.3% 向上、CholecT50 におけるアクショントライplet 認識の mAP-IVT 39.54%、およびスキル評価、ポリアプス分割、深度推定などの分野での成果は、SurgMotion を汎用かつ運動志向の手術動画理解の新たな標準確立しました。

Original Content

arXiv:2602.05638v3 Announce Type: replace Abstract: While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details, such as smoke, specular reflections, and fluid motion, rather than semantic structures essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion introduces three key technical innovations tailored to surgical videos: (1) motion-guided latent masked prediction to prioritize semantically meaningful regions, (2) spatiotemporal affinity self-distillation to enforce relational consistency, and (3) spatiotemporal feature diversity regularization (SFDR) to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate SurgMotion-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that SurgMotion significantly outperforms state-of-the-art methods on surgical workflow recognition, achieving 14.6 percent improvement in F1 score on EgoSurgery and 10.3 percent on PitVis; on action triplet recognition with 39.54 percent mAP-IVT on CholecT50; as well as on skill assessment, polyp segmentation, and depth estimation. These results establish SurgMotion as a new standard for universal, motion-oriented surgical video understanding.