arxiv_cs_cv 2026年4月24日

VidHal: VLLM における時間的な幻覚をベンチマークする

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Translated: 2026/4/24 19:48:51

vision-language-modelshallucinationvideo-benchmarkingartificial-intelligencetemporal-dynamics

Japanese Translation

arXiv:2411.16771v3 Announce Type: replace 要約：視覚大規模言語モデル (VLLMs) は、広く幻覚の発生に脆弱であると認識されています。既存の研究は主に画像入力に基づいているだけで、ビデオベースの幻覚への探索は限定的であり、さらに現在の評価手法は、ビデオの豊かな空間・時間動的性質によって悪化しやすい生成応答における繊細なエラーを捉えていません。この課題に対処するため、私たちは VLLMs におけるビデオベースの幻覚を評価することを目的に設計したベンチマーク「VidHal」を導入しました。VidHal は、多種多様な一般的な時間的特徴を持つビデオインスタンスをボートストラッピングによって構築されています。私たちのベンチマークの決定的な特徴は、各ビデオに伴う異なるレベルの幻覚を表すキャプションの慎重な作成にあります。細分化された評価を可能にするため、私たちは VLLMs に幻覚の程度によってキャプションをランク付けさせるという革新的なキャプション順序付けタスクを提案しました。私たちは VidHal を大規模に実行し、多様なモデルの包括的な評価を行いました。その結果、既存の VLLMs における幻覚生成に関する顕著な限界が明らかになりました。私達のベンチマークを通じて、私たちは VLLM の能力、特に幻覚に関する包括的な理解、およびこの問題を緩和するために高度な VLLMs をさらに開発することを目的としています。

Original Content

arXiv:2411.16771v3 Announce Type: replace Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.