arxiv_cs_lg 2026年4月24日

DistortBench: 画像変形識別におけるビジョン言語モデルのベンチマーク

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Translated: 2026/4/24 20:03:03

benchmarkvision-language-modelsimage-distortionperception-evaluationmodel-assessment

Japanese Translation

arXiv:2604.19966v1 Announce Type: cross 摘要：ビジョン言語モデル（VLM）は、低レベルの画像劣化への感受性が重要となるコンテンツモデレーション、画像回復、品質モニタリングなどの分野でますます広く利用されています。しかし、変形の種類と重症度の認識能力はあまり理解されていません。私たちは、VLM の参照なしの変形感知を診断するためのベンチマークである DistortBench を提示しました。DistortBench は、27 種類の変形、6 つの知覚カテゴリー、5 つの重症度レベルを含む 13,500 問の 4 択質問から構成されています。25 つの変形は KADID-10k の校正を継承しており、2 つの追加回転変形は単調な角度ベースのレベルを使用しています。私たちは、5 つのファミリーから 17 つのオープンウエイトモデルと 1 つの専用モデルを含む 18 つの VLM を評価しました。高レベルのビジョン言語タスクにおいて強力な性能を示しているにもかかわらず、最も良いモデルの正確率は 61.9% に過ぎず、人間の多数決基準である 65.7%（個々の平均：60.2%）より僅かに低く、低レベルの知覚的理解は現在の VLM の主要な弱点であることを示しています。当社の解析は、さらにモデルサイズへの弱く非単調なスケーリング、多くのベース思考ペアにおける性能低下、モデルファミリー間で異なる重症度応答パターンなどの結果を明らかにしました。私たちは、DistortBench が VLM の低レベル視覚認識の測定と改善に役立つベンチマークとなることを願っています。

Original Content

arXiv:2604.19966v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.