arxiv_cs_cv 2026年4月20日

1 つの画像とマルチモーダリティだけで、新しい視点合成が可能

A Single Image and Multimodality Is All You Need for Novel View Synthesis

Translated: 2026/4/20 10:51:04

novel-view-synthesisdiffusion-modelsmulti-modal-sensingdepth-estimationcomputer-vision

Japanese Translation

arXiv:2602.17909v2 発表タイプ：置き換えアブストラクト：拡散モデルアプローチは、単眼深度推定から推測された幾何学を条件として生成モデルに条件付けることで、最近、1 つの画像からの新しい視点合成で強力な性能を示しました。しかし、実際には、合成された視覚の品質と一貫性が、単元深度推定に基づく基盤深度の信頼性によって本質的に制限されており、低テクスチャ、悪天候、または大量の遮蔽を持つ現実世界環境ではしばしば脆弱です。本研究では、これらの限界を克服するための、極めて効果的で簡単な手法として、疎なマルチモーダルレンジ測定データを統合することを示します。我々は、自動車のレーダーや LiDAR など極めて疎なレンジセンシングデータを活用して、拡散モデルに基づく新しい視点合成に堅牢な幾何学的条件付けを供給する濃い深度マップを生成するマルチモーダル深度再構築枠組みを導入します。我々のアプローチは、局所的なガウス過程形式を用いて角度ドメインで深度をモデル化することで、計算効率の高い推論を可能にし、観察が限られた領域での不確実性を明示的に定量化します。再構築された深度と不確実性は、既存的な拡散モデルに基づくレンダリングパイプラインにおける単元深度推定器のドロップイン替わりとして使用され、生成モデルそのものを修正することなく、新しい視点ビデオ生成において幾何学的一貫性と視覚的品質の両方に大幅な改善をもたらします。これらの結果は、拡散モデルベースの視点合成における信頼性の高い幾何学的先验の重要性を示し、極めて疎なレベルであってもマルチモーダルセンシングの実用的な利益を実証しています。コードは https://github.com/importAmir/MultiModalNVS に公開されています。

Original Content

arXiv:2602.17909v2 Announce Type: replace Abstract: Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low-texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity. Code is publicly available at: https://github.com/importAmir/MultiModalNVS