arxiv_cs_cv 2026年4月24日

Video-Robin: 意味に即した動画から音楽の生成のための再帰的拡散プランニング

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Translated: 2026/4/24 19:53:30

video-generationmusic-creationdiffusion-modelsautoregressive-planningmultimodal-learning

Japanese Translation

arXiv:2604.17656v2 発表型：代替クロス要約: 動画から音楽へ (V2M) の生成は、入力動画の背景音楽を作成する基本的なタスクです。最近の V2M モデルは、一般的にビジュアル条件化に頼ることで視覚的整合性を達成し、エンドユーザーには限られたセマンティックおよびスタイルの制御性を提供しています。本稿では、動画コンテンツに意味に即した音楽生成を可能にする高速かつ高品質なモデルである Video-Robin を提案します。音楽の忠実性とセマンティック的理解のバランスを取るために、Video-Robin は再帰的プランニングを拡散ベースの合成に統合しています。具体的には、再帰的モジュールが視覚入力とテキスト入力をセマンティックに整合させることで、高音質の音楽ラテンツを作成し、それを高位音楽ラテンツとしてモデル化します。これらのラテンツは、ローカルな拡散トランスフォーマーを使用して一貫性があり、高音質の音楽に後続で洗練されます。セマンティックに駆動されたプランニングを拡散ベースの合成に分解することで、Video-Robin は音声のリアリズムを犠牲にすることなく、創作者が細かい制御を実行可能にします。我々の提案されたモデルは、SOTA に比べて推論速度が 2.21 倍速く、在分布および不在分布のベンチマークで、動画入力のみを受け入れるベースラインおよび追加特徴を条件付けたベースラインを上回る成績を示しました。論文受稿時にすべてのコードをオープンソース化いたします。

Original Content

arXiv:2604.17656v2 Announce Type: replace-cross Abstract: Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.