arxiv_cs_cv 2026年2月10日

MIND: World モデルにおける記憶一貫性およびアクション制御のベンチマーク

MIND: Benchmarking Memory Consistency and Action Control in World Models

Translated: 2026/3/15 19:03:52

Japanese Translation

arXiv:2602.08025v1 Announce Type: new 要約：世界モデルは動的な視覚環境を理解し、記憶し、予測することを目的としていますが、それらの基礎的能力を評価するための統一されたベンチマークは依然として不足しています。このギャップを埋めるために、私たちが導入した MIND は、世界モデルの記憶の一貫性とアクション制御を評価するための、最初のオープンドメインクローズループリビュードされたベンチマークです。 MIND は、1080p で 24 FPS の 250 個の高品質なビデオを含み、共有アクションスペースを備えた 100 個（第一人者視点）+ 100 個（第三人者視点）のビデオクリップを、さらに八つの多様なシナリオにわたって八つの異なるアクションスペースを覆う 25 + 25 個のクリップを含んでいます。私たちは、時間的安定性と視点間での文脈的一貫性を捉えるため、記憶の一貫性とアクション制御という 2 つの核心的な能力を測定する効率的な評価フレームワークを設計しました。さらに、異なるキャラクター移動速度やカメラ回転角度を含む various アクションスペースを設計し、共有されたシナリオ下での異なるアクションスペースにわたるアクション一般化能力を評価しました。MIND における将来のパフォーマンスベンチマークを容易にするために、私たちは新しいインタラクティブな「ビデオからワールド」ベースラインである MIND-World を導入しました。大規模な実験は MIND の完全性を示し、現在の世界モデルにおける重要な課題、すなわち長期的な記憶の一貫性を維持する難しさや、アクションスペースを超えた一般化能力を明らかにしました。プロジェクトページ：https://csu-jpg.github.io/MIND.github.io/,

Original Content

arXiv:2602.08025v1 Announce Type: new Abstract: World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: https://csu-jpg.github.io/MIND.github.io/