arxiv_cs_lg 2026年4月24日

時効延長型 mixture-of-experts モデル

Temporally Extended Mixture-of-Experts Models

Translated: 2026/4/24 19:56:54

mixture-of-expertsreinforcement-learninggpt-ossllm-optimizationgpu-memory

Japanese Translation

arXiv:2604.20156v1 公式発表種類：新要約：Mixture-of-Experts モデルは、推論速度を一定に保ちつつ能力をスケールさせるために現在広く採用されていますが、ほぼ毎トークンでエキスパートを切り替えています。モデルが利用可能な GPU メモリを超えると、この切り替えの頻繁さはオフロードやプレフィーチングなどの最適化を無効化してしまいます。我々は、強化学習におけるオプションフレームワークがこの問題を解決する完璧な手段であるという主張をし、時効延長型 mixture-of-experts レイヤの導入を提言します。デリバレーションコストを含むオプションクリティックフレームワークに基づき、我々は各レイヤーにエキスパートセットの切り替えタイミングや読み込み対象の決定を学習するコントローラーを追加しました。gpt-oss-20b に低ランクアダプタと自己蒸留報酬を適用することで、我々の手法は切り替え率を 50% 以上から 5% 未満に低下させ、MATH、MMLU、MMMLU においてベースモデルの精度を最大 90% 維持したまま、時効延長型 MoE への転換を可能にしました。これは、既存の事前学習済みモデルさえも軽量な訓練で時効延長型 MoE に変換できることを示しており、デリバレーションコストがモデルトレーナーに切り替え率と能力のトレードオフを許可することを意味します。我々は、記憶効率の高いスーリングと継続的学習に役立つ、オプションフレームワークに根ざした原則的な道を開くことを願っています。

Original Content

arXiv:2604.20156v1 Announce Type: new Abstract: Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.