arxiv_cs_cv 2026年4月24日

物理信号に基づく映像推理のgrounding

Grounding Video Reasoning in Physical Signals

Translated: 2026/4/24 19:47:00

video-understandingphysical-simulationbenchmark-designcomputer-visionvisual-reasoning

Japanese Translation

arXiv:2604.21873v1 Announce Type: new Abstract: 物理映像の理解は、単に出来事を正しく名詞化することを超えています。モデルはテキスト的な規則に基づいて、注ぐこと、スライドすること、衝突することについての質問に答えることができるのに、時間の定位や空間の定位において出来事を失敗する可能性があります。我々は、物理映像の理解に向けたgroundedベンチマークを導入し、これにより V-STaR の what--when--where 評価構造を、4 つの映像ソース、6 つの物理学分野、3 つのプロンプトファミリー（physics、vstar_like、neutral_rstr）、そして 4 つの入力条件（original、shuffled、ablated、frame-masked）に拡張しました。このベンチマークは、SSV2、YouCook2、HoloAssist、および Roundabout-TAU の 1,560 の基礎映像クリップを含んでいます。各クリップは最初に共有された grounded イベントレコードに変換され、3 つのクエリファミリーはそのレコードから派生しています。時制的および空間的ターゲットはプロンプトファミリーを超えて共有されますが、非物理学ファミリーは同じレコードから導出された決定論的な家族固有の意味 a_what ターゲットを使用します。モデルおよびプロンプトファミリーを問わず、物理学が全体として最も強力なレジメであり、vstar_like が最も明確な非物理学の意味比較であり、neutral_rstr がより難しいテンプレート制御として振る舞うことが示されています。プロンプトファミリーへの強靭さは普遍的ではなく選択的であり、乱雑化の利得は弱い元のケースに集約され、空間 grounding は全ての環境で最も弱いものです。これらの結果は、映像 Q&A 推理ベンチマークが、総括的な精度に加えて、物理的に grounded、プロンプト認識、および乱雑化認識を備えた診断を報告すべきであることを示唆しています。

Original Content

arXiv:2604.21873v1 Announce Type: new Abstract: Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.