arxiv_cs_lg 2026年4月24日

Expert Upcycling: Mixture-of-Experts の計算効率 frontier の転換

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Translated: 2026/4/24 19:54:28

mixture-of-expertsllm-scalingcompute-efficiencyexpert-upcyclingcontinual-pretraining

Japanese Translation

arXiv:2604.19835v1 発表タイプ：新規\n抽出： Mixture-of-Experts（MoE）は、大規模言語モデルをスケーリングする支配的なアーキテクチャとなっています。最先端のモデルは、稀疏なエクスパートルーティングを通じて、総パラメータ数とトークンあたり計算量を分離しています。スケーリングの法則によると、固定されたアクティブ計算量の下、モデルの品質は総パラメータ数と予測可能な関係でスケーリングしており、MoE はこれをエクスパート数の増加によって実現します。しかし、大規模 MoE モデルのトレーニングは高価であり、メモリ要件とデバイス間の通信両方が総パラメータ数に応じてスケーリングします。私たちは、継続的事前トレーニング（CPT）中にエクスパートの数を増加させることで MoE の容量を段階的に拡張する手法である「Expert Upcycling」を提案します。トレーニング済みの E-Expert モデルから出発し、上サイクル演算子（upcycling operator）はエクスパートの複製とルーターの拡張によって mE-Expert モデルを構築します。この過程でトパー-K（top-K）ルーティングは固定され、トークンあたりの推論コストが保持されます。複製は温かい初期化を提供します：拡張されたモデルはソースチェックポイントから学習された表現を継承し、ランダムな初期化よりもはるかに低い損失から始まります。その後続 CPT は複製されたエクスパート間の対称性を破り、専門性を導き出します。私たちは上サイクル演算子を形式化し、品質ギャップを容量項と初期化項に分解する理論的枠組みを開発しました。さらに、利用性ベースのエクスパート選択を導入し、これは勾配ベースの重要性スコアを使用して非均匀的複製をガイドします。これは CPT が限られている場合、ギャップの閉塞を 3 倍に増大させます。7B〜13B の総パラメータ実験では、上サイクルモデルは固定サイズベースラインの評価損失に一致し、GPU 時間 32% を節約しました。モデルスケール、アクティブ化比率、MoE アーキテクチャ、およびトレーニング予算に関する包括的なアベレーションから、Expert Upcycling の展開のための実用的なレシピが得られ、これは大きな MoE モデルをゼロから訓練する代わりに、原則的で計算効率の高い代替手段として確立されています。

Original Content

arXiv:2604.19835v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.