arxiv_cs_cv 2026年2月10日

Video World Modelsにおける物理の解釈

Interpreting Physics in Video World Models

Translated: 2026/2/11 13:56:24

Japanese Translation

arXiv:2602.07050v1 発表タイプ: new 要旨: 物理的推論に関する長年の問いの一つは、video-based modelsが物理的に正確な予測を行うために物理変数のfactorized representationsに依存する必要があるのか、それともタスク固有のdistributedな方法で暗黙にそれらの変数を表現できるのか、という点である。現代のvideo world modelsは直感的物理学ベンチマークで高い性能を達成している一方で、内部的にどちらの表現体制を実装しているかは不明のままである。本研究は、大規模なvideo encoders内部にあるphysical representationsを直接調べる最初のinterpretability studyである。layerwise probing、subspace geometry、patch-level decoding、そしてtargeted attention ablationsを用いて、physical informationがどの層で利用可能になるか、またencoder-based video transformers内でどのように組織されているかを特徴付ける。アーキテクチャを横断して、我々は物理変数がアクセス可能になる中間深さにおける鋭い遷移、これを"Physics Emergence Zone"と呼ぶ、を特定した。Physicsに関連する表現はこの遷移の直後にピークに達し、出力層に向かって劣化する。運動を明示的な変数に分解すると、速度や加速度のようなスカラー量は初期層から利用可能である一方で、運動方向はPhysics Emergence Zoneでのみアクセス可能になることが分かった。特に、方向はcircular geometryをもつ高次元のpopulation structureを通じて符号化されており、制御には複数特徴の協調的介入（coordinated multi-feature intervention）が必要であることがわかった。これらの発見は、現代のvideo modelsが古典的なclassical physics engineのように物理変数をfactorized representationsで扱っているわけではないことを示唆する。代わりに、それらはphysical predictionsを行うのに十分なdistributed representationを用いている。

Original Content

arXiv:2602.07050v1 Announce Type: new Abstract: A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.