arxiv_cs_cv 2026年4月20日

UniEditBench: 蒸馏されたマルチモーダル大モデルを介した画像および動画編集のための統合的かつ費用対効果の高いベンチマーク

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

Translated: 2026/4/20 10:43:34

unisub-benchmultimodal-mlmvisual-editingmodel-distillationarxiv-paper

Japanese Translation

arXiv:2604.15871v1 発表タイプ：新規要旨：視覚編集モデルの評価は手法とモードにわたり断片化されている。既存のベンチマークは特定のパラダイムに特化しており、公平なパラダイム間比較が困難であり、また動画編集は信頼できる評価ベンチマークを欠いている。さらに、一般的な自動指標は人間の好みを正確に反映していないものの、直接大規模マルチモーダルモデル（MLLMs）をエvaluator（評価者）として導入すると、計算リソースおよび金銭的コストが非現実的に高くなる。本研究では、画像と動画編集における構築ベースの手法および指示ベースの手法を共通プロトコル下で支援する統合ベンチマーク「UniEditBench」を提案する。UniEditBench は、九つの画像操作（Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder）と八つの動画操作を含む構造化された分類体系を備え、数の数え方や空間再配置などの難しい構造化タスクのコーバレッジも持つ。スケーラブルな評価を可能にするために、高容量な MLLM ジャッジモデル（Qwen3-VL-235B-A22B Instruct）を、構造適合性、テキスト整合性、背景の一貫性、自然さ、時間的・空間的一貫性（動画の場合）の多次元スコアリングを提供する軽量 4B/8B モデルに蒸馏した。実験结果表明、蒸馏されたエvaluator は人間の判断と強い一致を保ちながら、教師モデルに対する導入コストを大幅に削減する。UniEditBench は、現代の視覚編集手法をベンチマークするための実用的かつ再現性の高いプロトコルを提供する。当社のベンチマークおよび関連する報酬モデルは、https://github.com/wesar1/UniEditBench に公開されている。

Original Content

arXiv:2604.15871v1 Announce Type: new Abstract: The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.