arxiv_cs_cv 2026年4月20日

構造化された潜在空間投射を介した欠落または全体的なモダリティ下での堅牢なマルチスペクトルセマンティックセグメンテーション

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

Translated: 2026/4/20 10:43:17

multispectralsemantic-segmentationremote-sensingmultimodaldeep-learning

Japanese Translation

arXiv:2604.15856v1 発表タイプ: 新要約：マルチモーダルリモートセンシングデータはセマンティックセグメンテーションに補完的な情報を提供しますが、実世界のデプロイメントでは、センサーの故障、取得の問題、あるいは厳しい大気条件により、一部のモダリティが利用できない場合があります。既存のマルチモーダルセグメンテーションモデルは、通常、入力全体にわたって共有表現を学習することで欠落したモダリティに対応しています。しかし、このアプローチはモダリティ固有の補完的な情報を犠牲にし、すべてのモダリティが利用可能な場合の性能を低下させるトレードオフを導入する可能性があります。本稿では、モダリティ不変情報とモダリティ固有情報の両方を保持するように設計された CBC-SLP というマルチモーダルセマンティックセグメンテーションモデルを通じて、この限界に対処します。モダリティ整合に関する理論的実績（完全に整合されたマルチモーダル表現がダウンストリームの予測タスクで非最適の性能をもたらす可能性を示唆）を着想とした、構造された潜在空間投射アプローチをアーキテクチャ的な誘導バイアスとして提案します。これは損失項を通じてこの戦略を強制するのではなく、直接アーキテクチャに取り込みます。具体的には、ランダムなモダリティドロップアウト下での堅牢性を維持しつつ補完情報を効果的に利用するために、潜在表現を共有成分とモダリティ固有成分に構造化し、ランダムなモダリティ可用性マスクに応じてそれらをデコーダへ適応的に転送します。3 つのマルチモーダルリモートセンシング画像セットに対する広範な実験では、CBC-SLP が全モーダルシナリオと欠落モーダルシナリオを問わず、最先端のマルチモーダルモデルを常に凌駕していることが示されました。さらに、提案された戦略が共有表現では保存され得ない補完的な情報を回復できることを実証的に示しました。コードは https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP- に利用可能です。

Original Content

arXiv:2604.15856v1 Announce Type: new Abstract: Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.