arxiv_cs_ai 2026年2月10日

ソウルエックス・サンガー:高品質の零奏鸣式音声合成向付け

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Translated: 2026/3/7 13:13:16

singing-voice-synthesiszero-shot-learningopen-source-systemmultilingual-support

Japanese Translation

最近では、スピーキング合成に対して急速な進歩が見られましたが、オープンソースのサウンドシンクス（SVS）システムは、特に実用的な配布に関して面倒くさくなっています。この報告では、サウルエックス・サインガーを紹介します。これは、具体的な配布に焦点をあてた高い品質のオープンソース SVS システムです。ソウルエックス・サインガーは、シングリーガenerationを制約する、メイドゥミーユ（MIDI）やメロディーの表記いずれにも対応し、本物世界での生産ワークフローにおいて柔軟で表現強くてコントロールします。大きな約42,000時間のボーカルデータを使用したトレーニングにより、システムは言語に関する Mandarin 中国語、英語、カオネンデを支持し、バラエティ豊かな音楽条件下に対して状況の業界最良の合成品質です。さらに、零奏鳴 SVS 的なパフォーマンスの実践的な状況における確立評価に焦点を当てるため、ソウルエックス・サインガーアバーニンクから創設されたセラックスエヴァーナと呼ばれている専門化された指標があることから、ゼロ・ショット設定において組織的な評価が可能になります。

Original Content

arXiv:2602.07803v1 Announce Type: cross Abstract: While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.