arxiv_cs_cv 2026年4月20日

OSCBench: テキストからビデオ生成におけるオブジェクト状態変化のベンチマーク評価

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Translated: 2026/4/20 10:51:28

text-to-videoobject-state-changebenchmark-evaluationdeep-learningcomputer-vision

Japanese Translation

arXiv:2603.11698v2 Announce Type: replace 概要：テキストからビデオ生成（T2V）モデルは、視覚的に高品質かつ時間的に整合性のあるビデオを生み出すことで迅速な進歩を遂げています。しかし、既存のベンチマークは主に関視覚的品質、テキストとビデオの整合性、または物理的可能性に焦点を当てており、テキストプロンプトに明示的に指定されたオブジェクト状態変化（OSC）という動作理解の重要な側面はほとんど探求されていません。OSC とは、皮をむくようすやレモンをスライスすることなど、動作によって引き起こされるオブジェクトの状態の転換を指します。この論文では、T2V モデルの OSC 性能を評価するために特別に設計されたベンチマークである OSCBench を紹介します。OSCBench は調理インストラクションデータから構築され、アクションとオブジェクトの相互作用を規則的な、新しい、および構成的なシナリオに体系的に整理し、分布内性能および汎用性を調べるために設計されています。私たちは、6 つの代表的なオープンソースおよびプロプライエタリな T2V モデルを対象に、人間参加者研究およびマルチモーダル大規模言語モデル（MLLM）に基づく自動評価の両方を使用して評価を行いました。私たちの結果は、セマンティックとシーンの整合性において強い性能を示すにもかかわらず、現在の T2V モデルが一貫して正確で時間的に整合性ののあるオブジェクト状態変化に苦戦し、特に新しいシナリオおよび構成的な設定において特に顕著であることを示しています。これらの発見は、OSC をテキストからビデオ生成の主要なボトルネックとして位置づけ、OSCBench をステートアウェアなビデオ生成モデルの進展のための診断ベンチマーク確立しました。

Original Content

arXiv:2603.11698v2 Announce Type: replace Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.