arxiv_cs_ai 2026年2月10日

Hyper-データ化が前方AIの持つ持続可能性のコスト: 技術的観点からの影響

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

Translated: 2026/2/14 8:14:19

Japanese Translation

巨大なデータは、過去10年の間、前方人工知能（AI）モデルの大成功を生み出しました。この拡大は、先端的な技術企業がインターネットスケールのデータセットを集約及び編纂に向けた継続的な経験から成ります。本作品では、一方で我々は、AIにおける巨大なデータの環境的・社会的なコストを前方視点において検討しています。我々はこの統合化が指す方向への移行を「ハイパーデータ化」と名付け、それが未来の先端的なAIとその社会的影響に向けて重要な道筋を描いていますと主張します。データに関連するコストを定量的に計測・対比させるためには、我々は約55万件のホッギング・フェイス・ハブからのデータセットを分析しました。焦点は集積量の増加、ストリートに生じるエネルギー消費並びにカーボンフットプリント、そして言語データを使って社会的代表性であることにあります。我々は質的応答を利用してケニアから大企業への労働力を調査しました。「データ作成」の職人に直接雇用やグラフィックコンテンツに曝露などという事例です。外部からの情報を補充して当研究の結論を支持していますが、世界で広範であり専門的なデータセンターアークテクニカーシステムに存在します。私たちの分析では、ハイパーデータ化は僅かただ資源消費量を増やすだけではなく、環境的負担や労働リスク及び代表性被害を総合的な形でグラデーション分布したグローバル南方、脆弱な

Original Content

arXiv:2602.00056v2 Announce Type: replace-cross Abstract: Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.