arxiv_cs_cv 2026年4月20日

Vision-Language モデルにおけるモダリティの支配性を緩和するための情報ルーティング

Information Router for Mitigating Modality Dominance in Vision-Language Models

Translated: 2026/4/20 10:46:38

vision-language-modelsmulti-modal-attentioninformation-roboticsmodal-dominancearxiv

Japanese Translation

arXiv:2604.16264v1 Announce Type: new 要旨：ビジョン・ランゲージモデル（VLMs）は、多岐にわたるベンチマークで強力な性能を発揮していますが、予測が単一のモダリティに過度に依存する「モダリティの支配性」という課題に苦しんでいます。既往のアプローチは、主にモダリティの提供が十分であるという前提の下、モデルのアテンション配分を調整することでこの問題に対処しています。ただし、アテンションはモデルの焦点を決定するだけであり、欠如したり曖昧であるったりする情報を豊富にすることはできません。現実世界では、入力モダリティの情報密度や信号対雑音比に違いが存在します。このような状況では、単にモデルのアテンションを調整するだけでは、情報の不足という根本的な問題を解決できません。本稿では、 extsc{MoIR}（Multi-modal Information Router：マルチモーダル情報ルーティング）という情報をレベルでの融合方法、すなわち融合前に情報格差を明示的に減少させる手法を提案します。 extsc{MoIR}は、不十分な情報量のトークンを特定し、より強力なモダリティから補完的な情報をルーティングし、その後巨大な言語モデルによって処理される前に、情報密度の高いトークン表現を構築します。情報の利用可能性を変更することで、 extsc{MoIR}は、一つのモダリティが劣化している場合でも、モダリティの支配性が信頼して変更されることを可能にします。我々は、 extsc{MoIR}を複数のモデルバックボーンに対して使用された3つの広く利用されているマルチモーダルベンチマークで評価しました。実験結果は、 extsc{MoIR}が一貫してよりバランスの取られたモダリティ寄与を示し、特にモダリティの劣化下において、頑健性とDownstreamのパフォーマンスを改善することがあることを示しています。これらの知見は、クロスモーダル情報を明示的に変更することは、マルチモーダル推論モデルにおけるモダリティの支配性の緩和に対して効果的かつ補完的な戦略であることを示しています。

Original Content

arXiv:2604.16264v1 Announce Type: new Abstract: Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.