arxiv_cs_cv 2026年4月24日

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

Translated: 2026/4/24 19:52:01

video-generationpose-spacekinematic-rigneural-diffusion3d-animation

Japanese Translation

arXiv:2604.17623v2 Announce Type: replace 摘要：クイネマティックなリグは 3D メッシュを関節で表現するための構造化されたインターフェースを提供しますが、特定のアセットに対応する plausible マニホールの関節配置を内在的に表すことはできません。このようなポーズスペースがない場合、確率的サンプリングや手動でのリグパラメータ操作は、解剖学的な超伸展や非物理的な自己交差のようなセマンティック的または幾何学的な違反を招きやすいです。私たちは、事前にトレーニングされたビデオ拡散モデルから運動の事前知識を蒸馏することで、オートリグされたメッシュの有効な関節配列の潜在分布を発見する前向きのフレームワーク「Video-informed Pose Spaces (ViPS)」を提案します。既存の手法が稀な芸術家が作成した 4D データセットに依存しているのに対し、ViPS は一般分布に対して与えられたリグパラメータ化において生成するビデオの事前知識を転移します。スキンされたメッシュに適用される微分幾何学の検証者は、手動の正規化子を必要とせずにアセット固有の有效性を強制します。私たちのモデルは、多様なサンプリング、逆運動学のマニホールの投影、キーフレーム링のための時間的に整合的な軌跡をサポートする滑らかでコンパクトかつ制御可能なポーズスペースを学習します。さらに、蒸馏された 3D ポーズサンプルはビデオ拡散を誘導するための精密なセマンティックの代理として機能し、生成する 2D の事前知識と構造化された 3D のクイネマティック制御の間のループを閉じます。私たちの評価では、ビデオの事前知識だけでトレーニングされた ViPS が、合成の芸術家が作成した 4D データータでトレーニングされた最先端の手法と同程度の可能性と多様性を示しました。特に重要なのは、普遍モデルとしての ViPS が、分布外のスペーシーズおよび未見のスkeletal トポロジーに頑健なゼロショットの一般化を示すことです。

Original Content

arXiv:2604.17623v2 Announce Type: replace Abstract: Kinematic rigs provide a structured interface for articulating 3D meshes, but they lack an inherent representation of the plausible manifold of joint configurations for a given asset. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters often leads to semantic or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feed-forward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce artist-authored 4D datasets, ViPS transfers generative video priors into a universal distribution over a given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce asset-specific validity without requiring manual regularizers. Our model learns a smooth, compact, and controllable pose space that supports diverse sampling, manifold projection for inverse kinematics, and temporally coherent trajectories for keyframing. Furthermore, the distilled 3D pose samples serve as precise semantic proxies for guiding video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely on video priors, matches the performance of state-of-the-art methods trained on synthetic artist-created 4D data in both plausibility and diversity. Most importantly, as a universal model, ViPS demonstrates robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.