arxiv_cs_lg 2026年4月20日

ConFu: 未来を鑑賞し、より良い推測サンプリングを実現する

ConFu: Contemplate the Future for Better Speculative Sampling

Translated: 2026/4/20 11:08:22

speculative-decodinglarge-language-modelsdraft-modelsinference-accelerationllm-optimization

Japanese Translation

arXiv:2603.08899v2 発表タイプ：replace-cross 要約：推測デコード（Speculative decoding）は、軽量なドラフトモデルを介して候補トークンを提案し、これを目標モデルが検証することで、大規模言語モデル（LLM）の推論を加速する強力なアプローチとして台頭しました。このパラダイムの効果は、ドラフトモデルの品質に大きく依存しており、既存のドラフトモデルは誤りの蓄積により制限を受けています：現在の接頭辞にのみ条件付けられており、ステップを超えれば目標モデルとの予測がずれていきます。本研究では、ドラフトモデルが生成の未来の方向性を予見する能力を付与する新しい推測デコードフレームワークである ConFu（Contemplate the Future）を提案します。ConFu は、(i) 目標モデルから微小なコストで未来指向のシグナルを利用可能にする「 contemplation トークン」と「ソフトプロンプト」、(ii) 文脈感度の高い未来予測を可能にする MoE を用いた動的 contemplation トークンメカニズム、(iii) アンカートークンサンプリングと未来予測レプリケーションを含むトレーニングフレームワーク（これにより堅牢な未来予測が学習される）を導入します。ConFu は、Llama-3 3B/8Bおよび Qwen-3 4Bにおいて EAGLE-3 よりトークン承認率と生成速度をそれぞれ 8--11%および約 20% 向上させます。我々は、本仕事が推測デコードと連続的な推論トークンを架橋した初の試みであると信じており、LLM 推論の加速に向けた新たな方向性を示すと考えています。

Original Content

arXiv:2603.08899v2 Announce Type: replace-cross Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11\% on Llama-3 3B/8B and by approximately 20\% on Qwen-3 4B across downstream tasks. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.