arxiv_cs_ai 2026年4月24日

活性化パターンの分析に基づく解析的 FFN からの MoE への再構築

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

Translated: 2026/4/24 20:31:44

llmmixture-of-expertsffninference-optimizationmodel-restructuring

Japanese Translation

arXiv:2502.04416v3 Announce Type: replace-cross Abstract: 大規模言語モデル（LLM）の規模拡大は性能向上をもたらすが、推論コストを大幅に増大させ、フーズフォワードネットワーク（FFN）が計算資源の大部分を消費している。ミクスト・オブ・エキスパート（MoE）アーキテクチャはスパース活性化を通じてこのコストを削減できるにもかかわらず、既存の稠密モデルを MoE に再構築するには、通常、数十兆トークンのデータにわたる再学習が不可欠となる。本研究では、小さなカリブレーションデータセットのみを用いて、FFN をスパースな MoE アーキテクチャに迅速に再構築できる解析的ポストトレーニング枠組みを提案する。本手法は、ニューロン活性パターンを分析し、常に活性化する共有エキスパートと条件付きに活性化されるルーティングエキスパートへのニューロン分類を行い、代数的に表徴的なニューロン統計量からルータを構築することで、即時展開が可能となり、必要に応じて軽量なファインチューニングも可能にする。このアプローチは、稠密モデルへの適用だけでなく、階層的事業者性を実現するために既存の MoE モデルにも再帰的に適用可能である。実験により、計算負荷に支配されるシナリオでは、数分の処理と 2,000 サンプルのファインチューニングで最大 1.17 倍の速度向上が見られ、 orders of magnitude のリソースを要する手法を上回っている。

Original Content

arXiv:2502.04416v3 Announce Type: replace-cross Abstract: Scaling large language models (LLMs) improves performance but significantly increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While Mixture-of-Experts (MoE) architectures can reduce this cost through sparse activation, restructuring existing dense models into MoEs typically requires extensive retraining on hundreds of billions of tokens. We propose an analytical post-training framework that rapidly restructures FFNs into sparse MoE architectures using only a small calibration dataset. The method analyzes neuron activation patterns to partition neurons into always-active shared experts and conditionally activated routed experts, then constructs a router analytically from representative neuron statistics, enabling immediate deployment or optional lightweight fine-tuning. This approach applies both to dense models and recursively to existing MoE models for hierarchical sparsity. Experiments demonstrate up to $1.17\times$ speedup in compute-bound scenarios with only minutes of processing and 2k-sample fine-tuning, outperforming methods requiring orders of magnitude more resources.