arxiv_cs_cv 2026年2月10日

IM-Animation: 構造化されていない動作的表現を用いた同一性解離型キャラクターアニメーション

IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation

Translated: 2026/3/15 18:03:31

im-animationcharacter-animationvideo-diffusionmotion-synthesisdeep-learning

Japanese Translation

arXiv:2602.07498v1 発表型：新しい要旨: 最近、動画拡散モデルにおける進歩は、静止画像を駆動動画に基いてアニメーション化することにより、動きのある動画を生成するキャラクターアニメーションを著しく前進させました。明示的アプローチは、スケルトン、DWPose、その他の明示的な構造化シグナルを用いて動작を表すものの、空間的な不整合や変化する体型のスケールに対応するに困難な課題を抱えています。一方、構造化されていないアプローチは駆動動画から直接高次の構造化されていない動作用語を捉えるものの、同一性リークおよび動作と外観の絡み合いという問題を抱えています。以上の課題を解決するため、私たちが提案する新しい構造化されていない動作用表現は、毎フレームの動作用情報をコンパクトな 1D 動作トークンに圧縮します。この設計は、2D 表現に内包された厳格な空間制約を緩み、かつ動作動画からの同一性情報のリークを効果的に防止します。さらに、我々は時系列整合性を保つマスクトークンベースのリターゲティングモジュールを設計し、時系列的な訓練のボトルネックを強制することで、ソース画像の動作からの干渉を軽減し、リターゲティングの整合性を向上させました。我々の手法は、訓練の効率を高め、高い忠実度を確保するための 3 段階の訓練戦略を採用しています。大規模な実験により、我々の構造化されていない動作用表現および提案した IM-Animation の生成能力は、最先進の手法と比較して優等あるいは同等のパフォーマンスを達成することが実証されました。

Original Content

arXiv:2602.07498v1 Announce Type: new Abstract: Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.