arxiv_cs_cv 2026年4月24日

3D 可視幾何推定における決定的要因のパワーを解放する

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Translated: 2026/4/24 19:46:03

visual-geometry-estimation3d-reconstructionmachine-learningdepth-estimationcamerapose

Japanese Translation

arXiv:2604.21713v1 発表型：new 摘要：前向可視幾何推定は最近、急速な進展を遂げています。しかし、重要な課題は残っています：マルチフレームモデルは通常、クロスフレームの一貫性を生み出しますが、シングルフレーム精度では強力な単フレーム手法に劣ることが多いです。この観察は、モデル性能を駆動する決定的要因を厳密なアブロアクション研究を通じて調査する我々の体系的な探求を動機付け、いくつかの重要な洞察を明らかにしました：1）データ多様性と品質のスケーリングは、State-of-the-art 可視幾何推定手法であってもさらなる性能向上をもたらします；2）一般的に採用されている確信度感応損失と勾配ベースの損失機構は意図せず性能を妨げる可能性があります；3）シーケンスごとの揃え合わせとフレームごとの揃え合わせの両方を介した連成監視は結果を改善しますが、局所領域の揃え合わせはパフォーマンスを低下させ驚くべきことです。さらに、最適化ベースの手法と解像度高い入力の長所を統合するために、深度マップ、カメラパラメータ、そしてポイントマップの間に揃え合わせを強制する一貫性損失関数と、解像度高い情報を利用する効率的なアーキテクチャ設計を提案します。これらの設計を、前向可視幾何推定のための解像度強化モデルである CARVE 統合しました。ポイントクラウド再構成、ビデオ深度推定、カメラ姿勢・内部パラメータ推定における実験は、CARVE が多様なベンチマークを跨ぐ強いかつ安定したパフォーマンスを実証しました。

Original Content

arXiv:2604.21713v1 Announce Type: new Abstract: Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.