arxiv_cs_cv 2026年2月10日

Omni モーダルアーキテクチャと物理データエンジンによる物理知性の発現を探る

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Translated: 2026/3/15 17:04:33

omni-modalphysical-data-enginevideo-understandinginstruction-tuningflow-matching

Japanese Translation

arXiv:2602.07064v1 発表タイプ: 新しい投稿要旨: 物理的理解は、網羅的モーダル（omni-modal）モデルにおいて脆さを持続しており、その理由として、重要な物理的属性は視覚的に曖昧であり、かつウェブスケールなデータでは欠落しているためである。我々は、画像、音声、動画、テキストを統合的に理解し、音声生成と画像生成を内蔵したコンパクトな網羅的モーダルモデル「OmniFysics」を提示する。明示的な物理知識を注入するため、2 つのコンポーネントを備えた物理データエンジンを構築する。FysicsAny は、階層的回trieve を用いたカテゴライズされたプロトタイプデータベースを超え、顕著なオブジェクトを検証された物理的属性にマッピングすることで、物理に基づいた指示--画像的监督を行う。その後、物理法に制約された検証とキャプション書き換えを行う。FysicsOmniCap は、Web 動画を音視一貫性フィルタリングを通じて蒸馏し、交差モーダルな物理的クールの強調を含む高精度な動画--指示ペアを生成する。我々は、OmniFysics を段階的多モーダル対齐と指示チューニングで訓練し、テキスト生成には暗黙空間フローマッチングを採用し、必要な場合のみ生成を活性化させるインテントルーターを使用する。実験は、標準的多モーダルベンチマークにおける競争力的な性能を示し、物理指向の評価において改善された結果をもたらした。

Original Content

arXiv:2602.07064v1 Announce Type: new Abstract: Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.