arxiv_cs_cv 2026年4月20日

MMAudioSep: Video 生成モデルを制御し、動画/テキストによる音源分離を向こうへ

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Translated: 2026/4/20 10:52:12

mmaudiosepvideo-separationtext-to-soundgenerative-modelarxiv-2510

Japanese Translation

arXiv:2510.09065v2 Announce Type: replace-cross 摘要：MMAudioSep は、事前学習された動画から音の生成モデルを基盤とし、動画やテキストに問答した音源分離のための生成モデルを導入しました。動画やテキストと音の関係に関する知識を活用することで、モデルを効率よくトレーニングできます。つまり、ゼロから学習する必要がありません。MMAudioSep の性能評価では、既存の音源分離モデル（決定論的手法と生成モデルの両方に基づくものを含む）と比較しました。それらの基準モデルよりも優れていることを確認しました。さらに、音源分離の機能のためにファインチューニングを行いなくても、もともとの動画から音の生成能力は保持されることを示しました。これは、基礎的な音の生成モデルを音に関連するダウンストリームタスクに採用する潜在的な可能性を強調しています。当社のコードは https://github.com/sony/mmaudiosep に利用可能です。

Original Content

arXiv:2510.09065v2 Announce Type: replace-cross Abstract: We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.