arxiv_cs_cv 2026年4月24日

WorldMark: インタラクティブ映像ワールドモデルのための統合ベンチマークスイート

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Translated: 2026/4/24 19:45:49

interactive-video-generationworld-modelsbenchmarkingimage-to-videoartificial-intelligence

Japanese Translation

arXiv:2604.21686v1 Announce Type: new 要約：Genie、YUME、HY-World、Matrix-Game などのインタラクティブ映像生成モデルは急速に進化していますが、各モデルは独自のベンチマーク、私的なシーンおよび軌跡で評価され、公平な横モデル比較は不可能です。既存の公開ベンチマークでは、軌跡エラー、美的スコア、VLM（視覚言語モデル）に基づく判断など有用な指標を提供していますが、異種パラのモデル間でこれらの指標を比較できるようにするために必要な、同一のシーン、同一のアクションシーケンス、そして統合されたコントロールインターフェースという標準化されたテスト条件を提供するものは存在しません。本稿では、インタラクティブな画像から映像（Image-to-Video）ワールドモデルのためのそのような共通のフィールドを提供する最初のベンチマークである WorldMark を導入します。WorldMark は以下の貢献を提供します：(1) 共有 WASD スタイルのアクション語彙を各モデルの固有のコントロール形式へと変換する統合されたアクションマッピングレイヤー、これにより 6 つの主要モデルが同一のシーンおよび軌跡上で比較可能なアプラーストアップルスの比較が可能になります。(2) 第一人者視角と第三人者視角、フォトリアリストックなシーンとスタイル化されたシーン、そして Easy から Hard までの 3 つの難易度帯の 20 秒から 60 秒を跨ぐ 500 件の評価ケースを含む階層的なテストスイート。(3) 視覚品質、コントロールアライメント、ワールド一貫性を評価するためのモジュラーな評価ツールキット、研究者がフィールドが進化していく際に独自の指標を組み込むことができるように設計されています。私たちは将来の研究を促進するために、すべてのデータ、評価コード、およびモデル出力をリリースします。オフラインの指標に加え、World Model Arena (warena.ai) を立ち上げ、最先端のワールドモデル同士を並列対戦で互角に勝負させ、ライブリーダーボードを見ることができます。

Original Content

arXiv:2604.21686v1 Announce Type: new Abstract: Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.