arxiv_cs_lg 2026年4月20日

Differentially Private Deep Transfer Learning の最適ハイパーパラメータについて

On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

Translated: 2026/4/20 11:04:21

differentially-private-deep-learningtransfer-learninghyperparameter-tuninggradient-clippingprivacy-preserving-machine-learning

Japanese Translation

arXiv:2510.20616v2 Announce Type: replace 摘要：差分プライバシー (DP) トランズファラーラーニング（プライベートデータにプリトレーニング済みモデルを微調整する手法）は、プライバシー制約下で大規模モデルをトレーニングするための現在の最先端アプローチです。本稿では、この設定における 2 つの主要なハイパーパラメータであるクリッピングバウンド $C$ とバッチサイズ $B$ に焦点を当てます。現在の理論的知見（強いプライバシーを意味する $C$ は小さくすべきである）と、実務の結果（強いプライバシー条件下で大きな $C$ がより良く動作する）の間には明確な不整合があり、これは勾配分布の変化によるものです。計算リソースの制約（固定されたエポック数）を仮定する場合、既存の $B$ のチューニングための直感的手法は機能しませんが、累積 DP ノイズが、小さなバッチか大きなバッチの方がより良いかをよりよく説明します。また、タスク間で共通の (C, B) セッティングを使用する慣行が非最適なパフォーマンスをもたらす可能性についても指摘します。緩やかなプライバシーと厳しいプライバシー、そして豊富な計算リソースと限られた計算リソースの間での移行において、パフォーマンスが低下する傾向が特に顕著であることを発見しました。これは、クリッピングを勾配再重み付けの一種として分析し、累積 DP ノイズを検討したことで説明可能です。

Original Content

arXiv:2510.20616v2 Announce Type: replace Abstract: Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.