arxiv_cs_cv 2026年2月10日

インターネット動画からの弱监督による 3D 幾何学基礎モデルのスケーラブルな適応

Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video

Translated: 2026/3/15 19:02:42

3d-reconstructionfoundation-modelsweak-supervisiongaussian-splattingself-supervised-learning

Japanese Translation

arXiv:2602.07891v1 Announce Type: new Abstract: 幾何学基礎モデルは 3D 復元における有望な候補を示していますが、その進展は多様で大規模な 3D 注釈の不足によって著しく制限されています。インターネット動画はほぼ無限の原データを提供しますが、地上真幾何学の欠如と観測ノイズの存在ゆえに、これを幾何学学習のスケーリングソースとして利用するのは困難です。これを解決するために、スティーブ (SAGE)、原ビデオストリームからの幾何学基礎モデルのスケーラブルな適応の枠組みを提案します。スティーブは、階層的マイニングパイプラインを利用して、動画をトレーニング軌道とハイブリッドな监督に変換します：(1) 情報豊富で選択されたトレーニング軌道、(2) SfM 点群を用いたスパース幾何学アンチャリング（大域的な構造指針）、および (3) 3D ガウスレンダリングを用いた.dense_微分化一貫性（マルチビュー制約）。catastrophic forgettingを防ぐために、アンカーデータを用いた正則化戦略を導入しました。大規模な実験では、SAGE は最良の基準よりも未见ベンチマーク（7Scenes、TUM-RGBD、Matterport3D）における Chamfer Distance を 20-42% 削減し、ゼロショット一般化能力を大幅に向上させました。当研究の知識範囲内において、SAGE はインターネット動画を通じた幾何学基礎モデルの適応を初めて行い、汎用的な 3D 学習のためのスケーラブルなパラダイムを確立しました。

Original Content

arXiv:2602.07891v1 Announce Type: new Abstract: Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.