arxiv_cs_cv 2026年4月24日

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Translated: 2026/4/24 19:44:46

vision-language-modelsevaluator-benchmarkshallucination-detectionvision-question-answeringimage-generation

Japanese Translation

arXiv:2604.21523v1 Announce Type: new Abstract: 大規模ビジョン・ランゲージモデル（VLM）は、画像からテキスト（I2T）タスクとしての視覚的質問回答や、テキストから画像（T2I）生成タスクにおける他のモデルの出力評価に increasingly 利用されつつある。この利用が増加しているにもかかわらず、これらの Evaluator VLM の信頼性は十分に探求されていない。本稿では、I2T おもに T2I タスクの両方において、Evaluator VLM の信頼性を体系的に評価した。我々は、対象幻觉（object hallucinations）、空間推論、事実基盤、視覚的忠実性といった鍵となるエラー次元に沿って出力品質を低下させる、特化したパルトゥレーションを導入した。これらのパルトゥレーションは、Evaluator VLM がこれらの品質劣化を評価において適切に反映できるかどうかをテストするものである。40 以上のパルトゥレーション次元を跨る 4000 件以上のパルトゥレーシッドインスタンスを含む包括的なベンチマークを使用して、我々は単一回答スコアリング、対比比較、参照ガイド準拠のパラダイムを用いて 4 つの顕著な VLM を評価した。我々の発見は、現在の VLM エvaluator が顕著なブラインドスポットを示すことを明らかにした：彼らはしばしばパルトゥレーシッド出力を検出できず、一部の場合では 50% を超え、また微細な構成要素および空間的誤差に特に苦しんでいる。また、入力画像に矛盾する幻觉コンテンツに対して感受性がないことも多く見られた。対比比較はより信頼性が高いが、失敗率は依然として存在する。これらの結果は、現在の Evaluator VLM の非信頼性を強調し、ベンチマークおよび開発決定におけるそれらの展開には注意を促すものである。コードとデータは公衆に公開された。

Original Content

arXiv:2604.21523v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.