arxiv_cs_cv 2026年2月10日

MambaFusion: 多重モーダル 3D 物体検知のための適応型状態空間融合

MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

Translated: 2026/3/15 19:04:46

mambafusion3d-object-detectionmultimodal-fusionstate-space-modelsautonomous-driving

Japanese Translation

arXiv:2602.08126v1 発表タイプ：新規要約：信頼性の高い 3D 物体検知は自律型車両の基盤であり、カメラと LiDAR を使用する多重モーダル融合アルゴリズムは依然として大きな課題です。カメラは密度の高い視覚情報を提供しますが、深度の推定は不確定性が高く、LiDAR は正確な 3D 構造を提供しますが、データの分布が偏っているという課題があります。既存の BEV ベースの融合フレームワークは進歩を見せていますが、文脈モデルの非効率、空間不変性を持つ融合、および不確実性下での推論において課題を抱えています。ここでは、効率的で適応的かつ物理的に整合性の高い 3D 認識を実現する統一的多重モーダル検知フレームワークである「MambaFusion」を提案します。MambaFusion は選択的状態空間モデル（SSMs）をウィンドウ型トランスフォーマーと組み合わせて、線形時間内でグローバルな文脈を伝播し、同時に局所的な幾何学的忠実性を維持します。多重モーダルトークン整合性（MTA）モジュールと信頼性意識融合ゲートは、空間的な確実性と補正の整合性を基準としてカメラと LiDAR の特徴量を動的に再重み付けします。最後に、構造条件付き拡散ヘッドはグラフベースの推論と不確実性意識のデノイジングを統合し、物理的な可能性を強制し、整合された確信度を生成します。MambaFusion は nuScenes ベンチマークにおいて新しい記録的な性能を示し、線形時間の複雑性を維持しています。このフレームワークは、SSM ベースの効率性と信頼性駆動型融合を組み合わせたものが、現実世界の自律型車両システムに適用可能な堅牢で時間的に安定し、解釈可能な 3D 認識をもたらすと示しました。

Original Content

arXiv:2602.08126v1 Announce Type: new Abstract: Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.