arxiv_cs_ai 2026年2月10日

MENAspeechbank:アソシエートデータの制御可能なPipelineおよびPersona条件付き複数のターン会話に基づく多言語音声財図

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

Translated: 2026/3/7 11:31:05

audio-language-modelsreference-speech-bankpersona-oriented-interactionssynthetic-multimodal-data-generation

Japanese Translation

Audio 大規模言語モデル (AudioLLM)は、ストレートフォールディングを含むスピーカーと一般的なオーディオを利用して指示に従うが、データの多様性、会話形態と指導されているテキストからの充足が不足しています。特にポーサージングされたインタラクションやダイアLECTAL ビユートをカバーする時には、現実での複数のスピーカーからリクエストを収集し公開をするのは難しいです。我々はMENAspeechbankと呼ばれる高品質で約18,000人の言者で構成された参考となる音声財図を紹介します。この資源、私たちの開発者はこれを補完するために、これらのデータを使って：(i)ポーサープロファイルを豊かになり世界価値調査から似ている属性を持たせます。 (ii)約5,000の会話スナップショットが定義されます。 (iii)約5,000のターン会話形態にポーサーと一対一で似ており、語彙と類似度によってマッチします。(iv)エージェントはアシスタントとしてユーザーメンバーシップを保ちますが、ユーザーが彼または彼女のポーサーターンとするとき、我々はロールプレイの会話を約417,000回生成しLLMによってユーザが話します。 (v)レファレンススピーカーのオーディオを条件に合成して、スピーカーシェアに保ち、多国語でカバーするダイナミックなスペクトラムにより差別化された属性を持っています。我々はシミュレートされた会話と人間により重要な点と論理的な解析を与えてくれます。MENAspeechbankと生成した会話をオープンソースコミュニティに向けてリリースします。

Original Content

arXiv:2602.07036v1 Announce Type: cross Abstract: Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.