arxiv_cs_lg 2026年2月10日

Extreme-Ratio Chain-of-Thought Compression を活用した効率的な大規模言語モデル推論への道筋

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

Translated: 2026/3/15 16:05:21

large-language-modelschain-of-thoughtreasoningllm-compressionsupervised-fine-tuning

Japanese Translation

arXiv:2602.08324v1 Announce Type: new Abstract: チェーン・オブ・ Thought (CoT) 推論は、大規模言語モデル (LLMs) の推論能力を大幅に向上させることに成功したが、それが引き起こす推論のための計算オーバーヘッドは莫大である。既存の CoT 圧縮手法は、高い圧縮比において論理整合性の重要な喪失を経験することが多く、これが著しい性能低下を招く。高忠実度かつ高速な推論を達成するため、我々は答えの精度を維持しながらトークン予算を激しく削減する、新しい EXTreme-RAtio Chain-of-Thought Compression 框架 Extra-CoT を提案した。信頼性が高く忠実度の高い監督学習を提供するために、まず細粒度の註釈が施された数値 CoT データ上に、意味保存型圧縮器を専用に訓練した。この圧縮ペアに基づき、混合比監督微調整 (SFT) を行い、LLM を訓練した。これにより、多様な圧縮予算に従う学習を行い、強化学習 (RL) のための安定した初期化を提供した。さらに、我々は階層報酬を通じて低予算条件下での問解決能力を明示的に奨励するための、制約付きと階層比政策最適化 (CHRPO) を提案した。Extra-CoT の優位性を示すために、3 つの数値推論ベンチマークで実験を行った。例えば、Qwen3-1.7B を使用した MATH-500 で、Extra-CoT は精度向上 0.6% ながら、73% 以上のトークン削減を達成し、最先进 (SOTA) の手法を大きく凌駕した。

Original Content

arXiv:2602.08324v1 Announce Type: new Abstract: Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.