arxiv_cs_cv 2026年4月20日

Vision-Language モデルが見るものと認識するものを調和させる適応的な情報フロー

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Translated: 2026/4/20 10:42:47

vision-language-modelsinformation-flowvisual-question-answeringtoken-dynamicsmachine-learning

Japanese Translation

arXiv:2604.15809v1 Announce Type: new 摘要：視覚言語モデル（VLM）は、画像認識、ドキュメント解析、ビジュアルグラウンディングなど、多岐にわたるタスクにおいて高い能力を示しています。しかし、最近の研究では、VLM が質問に対応する正しい画像領域を捉えることができる一方で、必ずしも正しい回答を生み出すわけではないことが示されています。本稿では、この不一致は VLM 内部の情報フローが最適化されていないことに起因すると示唆します。具体的には、テキストトークンが関連のない視覚トークンに対して過度の注意を分配し、結果として誤った回答が生じることです。この観察に基づき、推論中の情報フローを調整することで VLM の認識能力を向上させることができることを示します。そのアイデアは、デコード時にテキストトークンが重要な視覚トークンのみと関連化されることで、関連のない領域からの干渉を除去することです。これを達成するために、我々は視覚トークンの重要性を決定するためのトークンダイナミクスに基づく手法を提案します。異なるデコード段階で顕著な活性化パターンを示す視覚トークンは重要であると見なされます。我々はこのアプローチを代表的なオープンソース VLM に適用し、視覚質問回答、ビジュアルグラウンディング、カウント、光学文字認識、オブジェクトハルシネーションを含む各種データセットで評価を行いました。結果、我々のアプローチはベースラインのパフォーマンスを大幅に向上させました。プロジェクトページ：https://cxliu0.github.io/AIF/。

Original Content

arXiv:2604.15809v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: https://cxliu0.github.io/AIF/.