arxiv_cs_cv 2026年4月24日

DepthMaster: ディフューजनモデルの統制による単眼depth推定の改善

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Translated: 2026/4/24 19:49:01

depth-estimationdiffusion-modelsmonocular-depthcomputer-visionneural-networks

Japanese Translation

arXiv:2501.02576v2 Announce Type: replace Abstract：ディフュージ・デノイジングパラダイム内における単眼depth推定は驚異的な一般化能力を示しますが、推論速度が低いという課題を抱えています。最近の手法は推論効率を改善しつつ同等の性能を維持する一ステップ確定的パラダイムを採用していますが、生成特徴と識別特徴との間のギャップを無視しており、非最適の結果となっています。本稿では、単眼depth推定タスク向けに生成特徴に適応させるよう設計された一ステップディフュージョンモデルであるDepthMasterを提案します。まず、生成特徴によって導入されるテクスチャ詳細への過剰適合を緩和するため、高品質なセマンティック特徴を組み合わせ、デノイジングネットワークの表現能力を向上させるFeature Alignmentモジュールを提案します。次に、一ステップ確定的枠組みの微細な詳細不足を解決するため、低周波数の構造と高周波数の詳細を適応的にバランスさせるFourier Enhancementモジュールを提案します。これら2つのモジュールの可能性を最大限に活用するために、2段階のトレーニング戦略を採用します。1段階目はFeature Alignmentモジュールを用いてグローバルなシーン構造を学習することに焦点を当て、2段階目はFourier Enhancementモジュールを活用して視覚的な品質を向上させます。これらの取り組みを通じて、我々のモデルはgeneralizationとdetail preservingに関するstate-of-the-art性能を実現し、様々なデータセットにおいて他のディフュージョンベースの手法を凌駕しました。プロジェクトページは https://indu1ge.github.io/DepthMaster_page にあります。

Original Content

arXiv:2501.02576v2 Announce Type: replace Abstract: Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.