arxiv_cs_lg 2026年2月10日

自然言語の統計から神経ネットのスケール解析則を導き出す

Deriving Neural Scaling Laws from the statistics of natural language

Translated: 2026/3/15 14:07:45

scaling-lawsneural-networksnatural-languagellmstatistical-theory

Japanese Translation

arXiv:2602.07488v1 Announce Type: new **要約**：実験的な神経ネットのスケール解析則は、大規模機械学習の経験的成果を著しく導き込んだにもかかわらず、どのモダンな自然言語データセットで訓練された LLM（大規模言語モデル）のこれらの重要な解析則の指数を定量的に予測できる理論は存在しません。本稿では、データ制限されたスケール解析則の場面で初めてのそのような理論を提供します。われわれは、神経ネットのスケール指数を単独で予測できる言語の 2 つの主要な統計的特性を特定しました：(i) トークン同士の相関がトークンペア間の時間区間にわたって減少する様式と、(ii) 条件付け文脈の長さが長いほど次のトークンの条件付きエントロピーが減少する様式。我々はさらに、これらの統計量に基づいて単純な公式を導き出し、パラメータも合成データモデルも不要に第一原理からデータ制限された神経ネットのスケール指数を予測しました。本理論は、TinyStories と WikiText という 2 つの本質的に異なるベンチマークから最初から GPT-2 および LLaMA 形式のモデルを訓練して得られた実験的に測定された神経ネットのスケール解析則との著しい一致を示しました。

Original Content

arXiv:2602.07488v1 Announce Type: new Abstract: Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.