arxiv_cs_lg 2026年4月20日

Olmo Hybrid: 理論から実践、そして理論へ

Olmo Hybrid: From Theory to Practice and Back

Translated: 2026/4/20 11:05:21

olmohybrid-modelsrnntransformerlanguage-modeling

Japanese Translation

arXiv:2604.03444v3 発表タイプ：更新要約：最近の研究では、変換器（transformer）ではない言語モデルの可能性が示されており、特に直線再帰型ニューラルネットワーク（RNNs）および再帰性とアテンションを混合したハイブリッドモデルが挙げられます。しかし、これらの新しいアーキテクチャがもたらす潜在的利益が、それをスケールアップするためのリスクと努力を正当化するかについてはコンセンサスが得られていません。これに対処するため、私たちはハイブリッドモデルが純粋な変換器に比べていくつかの側面において有利であることを示す証拠を提供します。まず、理論的には、ハイブリッドモデルは単に変換器と直線 RNN の表現力を継承するだけでなく、コードの実行など、両者を超えるタスクを表現できることを示します。この理論を実践に当てはめて、滑動ウィンドウ層を Gated DeltaNet 層で置き換えたパラメータ数が 7 億個の Olmo Hybrid を訓練しました。Olmo Hybrid は、標準的なプリトレーニングとミドルトレーニング評価において Olmo 3 を凌駕し、制御された大規模設定においてハイブリッドモデルの利点を示しました。さらに、ハイブリッドモデルは変換器に比べて有意に効率的にスケールし、高いパフォーマンスを説明しました。しかし、特定の形式的問題における高い表現力が、それに関連しないダウンストリームのタスクにおける優れたパフォーマンスや、より良いスケール効率につながることの理由についてはまだ不明です。この顕著なギャップを説明するために、私たちは理論へ戻り、増加した表現力がどのようにスケーリング効率を向上させるべきか、より良いスケーリング効率につながることについて論じ、このループを完了させます。全体として、私たちの結果は、アテンション層と再帰層を混合したハイブリッドモデルが、言語モデルのパラダイムに対する強力な拡張であることを示唆しています。それは、推理中のメモリ量を削減するだけでなく、プリトレーニング中のより効率的なスケールを持つ、より表現力豊かなモデルを得るための基本的な手段としてのものです。

Original Content

arXiv:2604.03444v3 Announce Type: replace Abstract: Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.