arxiv_cs_lg 2026年4月24日

WISCA: LLM のトレーニングを Weight Scaling を通じて改善するための軽量モデル移行方法

WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

Translated: 2026/4/24 20:08:21

llmtransformermachine-learningneural-networkoptimization

Japanese Translation

arXiv:2508.16676v2 発表タイプ：差し替え概要：Transformer アーキテクチャは LLM（大規模言語モデル）分野を徐々に支配しています。Transformer ベースの LLM のトレーニング最適化における最新の進歩は、主にアーキテクチャの変更や最適化器の調整に焦点を当てていますが、これらのアプローチはトレーニング中の重みパターンの体系的な最適化を欠いています。重みパターンとは、神経ネットワークの重みパラメータの分布と相対的な大きさのことであります。この課題に対処するために、私たちはネットワーク構造を変更せずに神経ネットワークの重みパターンを戦略的に改善することで、トレーニング効率とモデル品質を高めることに焦点を当てた Weight Scaling 法 called WISCA を提案しました。重みをリスカール化しモデル出力を維持することで、WISCA はモデルのトレーニング軌道を間接的に最適化します。実験は、WISCA が一般化能力と損失の減少によって測定される収束品質を大幅に向上させることを示しており、特に Grouped Query Attention (GQA) アーキテクチャを持つ LLM と LoRA フィンチューニングタスクにおいて顕著です。経験的な結果は、ゼロショットバリデーションタスクで平均 5.6% の改善、複数のアーキテクチャを通じてトレーニングパープレキシティで平均 2.12% の削減を示しています。

Original Content

arXiv:2508.16676v2 Announce Type: replace Abstract: Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.