arxiv_cs_cv 2026年2月10日

言葉を通じて道路を見る：RGB-T 運転シーンのセグメンテーションのための言語誘導型枠組み

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

Translated: 2026/3/15 18:02:44

rgb-tsegmentationself-drivingvism-languag-modelfusion-strategy

Japanese Translation

arXiv:2602.07343v1 Announce Type: new 要約：過酷な照明、照明、あるいは影の条件下における道路シーンの頑健なセマンティックセグメンテーションは、自動運転アプリケーションにおける核心的な課題です。RGB-サーマル融合は標準的なアプローチであるものの、既存の手法はすべての条件に対して統一された静的な融合戦略を適用しており、これは各モーダル特有のノイズがネットワーク全体に伝播してしまうことを可能にします。そこで、われわれはシーンに検出された条件に応じて融合戦略を動的に適応させる「CLARITY」を提案します。視覚言語モデル（VLM）の事前知識に基づき、ネットワークは照明状態に基づいて各モーダルの寄与度を調節しつつ、オブジェクトエンベディングを活用してセグメンテーションを実行します。これまでに固定された融合ポリシーが適用されるわけではありません。さらに、われわれは 2 つの機構を導入し、すなわち、ノイズ抑制手法が誤って棄却する有効な暗色オブジェクトのセマンティクスを保持する機構と、薄物体の境界をシャープにするためにマルチスケールにおける構造的整合性を強制する階層的デコーダです。MFNet データセット上の実験において、CLARITY が新たな状態の最良解（SOTA）を確立したことが示されており、mIoU が 62.3%、mAcc が 77.5% となっています。

Original Content

arXiv:2602.07343v1 Announce Type: new Abstract: Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.