arxiv_cs_ai 2026年2月10日

Fed-PISA: 個人差異化されたスタイル適応を含む Federated Voice Cloning

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

Translated: 2026/2/14 7:18:20

Japanese Translation

声の合成テキスト-to-スピーチ (TTS) は、目標話者からデータが限られている場合でも、テキストから expressive で personalized のスピーキングを生成する仕組みです。federated learning (FL) はこのタスクに協力性とプライバシーの保存を提供するためのフレームワークとしています。ただし、現行のアプローチには通信コストが高すぎて、統一的なスタイルがあらかじめ抑圧されてしまうため personalized の不足が出てしまいます。これに対抗することを目的とするFed-PISAは「Federated Personalized Identity-Style Adaptation」です。通信コストを減少させるために、Fed-PISAでは話者のトゥームがその場で当地に保証された個別の ID-LoRA を保有しつつも、スタイルのための軽量な LoRAのみをserver に伝えることでパラメーターデータを最小限です。統合方法とイノベーションは collaborative filtering を参照して独自にして customized のモデルをそれぞれのお客様向けに作ります。Fed-PISA の実験結果から、コイン版の通信コストを軽減したままスタイルの表現力が強化され、自然性も向上するため、標準的な federated ベースラインの方が優れていないことがわかります。

Original Content

arXiv:2509.16010v2 Announce Type: replace-cross Abstract: Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.