arxiv_cs_ai 2026年2月10日

データサイエンスと技術がAGIへの道 Part I: ドライバーデータの統合管理

Data Science and Technology Towards AGI Part I: Tiered Data Management

Translated: 2026/3/7 11:23:16

artificial-intelligencelanguaged-based-logic-modelsai-studies

Japanese Translation

人工知能の開発は、逐次なったデータ操作がもたらす学習のパラダイム遷移を経験した後，より具体的な情報の整理と活用により常にモデル能力が上昇するように考えられます。現在のLLM研究では、巨大な一方向的なデータサイズ拡張に頼ることによって、データ利用可能性や取得コスト等、新たな壁に遭遇しています。この論文では、AGI開発はデータとモデルが相互的に進化しているという新しいステージに入れると主張し、データ管理を動的に指揮するよう、モデル自身がデータ管理の役割も遂に果たすべきだと考えるのです。その目指すためには、「L0-L4」という統合管理フレームワークを利用し、多様な教育目標とコストの制約に対応できるLLM全体をサポートするようなものだと言えます。この統合管理フレームでは、データ情報に適した各種特徴や各段階のマネージメント戦略などを設定することで、LLM各段階におけるデータ分割が実現されることが可能となっています。これらの全てはトレインプレビュー、トレインエッジなどと呼ばれる特定の役割を果たします。このフレームワークでは質量、取得コスト、増幅的なトレイン効用といった要素をバランスよく管理しています。我々は、その有効性を通じて該当するプロトコルを使用して事前、中間、アライメントトレインの全ての段階でティアーデータフレームワークに関するエスパルス研究を見た結果、利用者の意識が大幅に増加することで訓練労力を大きく削減し、モデルの性能を向上させることができます。この新しい構想についてより多くのリサーチを行うためには私たちのティーアデータセマンテツと動作ツールはコミュニティに向けて公開することになります。

Original Content

arXiv:2602.09003v1 Announce Type: new Abstract: The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.