arxiv_cs_lg 2026年4月24日

社会の世界に存在するデータ生成確率分布への Pretending（ pretender）の費用について

The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World

Translated: 2026/4/24 20:07:05

machine-learningprobability-distributionfair-algorithmssocial-sciencesarxiv

Japanese Translation

arXiv:2407.17395v5 Announce Type: replace アブストラクト：機械学習の研究、特に公平かつ公平なアルゴリズムを推進する作業は、データ生成確率分布という概念に依存しています。標準的な前提としては、データポイントがそのような分布から「サンプリングされている」というので、観測されたデータからその分布について学習でき、したがって、同様にそこから引き出される未来のデータポイントも予測できる、と考えられています。しかし、私たちはそのような真の確率分布が存在しないこと、そしてそれらに関するロジックが社会的文脈では有害であることを主張します。私たちは、抽象的な分布に焦点を当てるのではなく、関連する人口に直接焦点を当てる代替のフレームワークが利用可能であること、そして古典的な学習理論はほとんど変化しないことを示します。さらに、私たちは真の確率やデータ生成分布という仮定が、機械学習の実践における行われた選択と追求された目標を誤解させる、そして遮蔽する可能性があることを主張します。これらの考慮に基づき、私たちは社会の世界におけるデータ生成確率分布の仮定を避けることを提案します。

Original Content

arXiv:2407.17395v5 Announce Type: replace Abstract: Machine Learning research, including work promoting fair or equitable algorithms, often relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are 'sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which are also drawn from it. We argue, however, that such true probability distributions do not exist and that the rhetoric around them is harmful in social settings. We show that alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged. Furthermore, we argue that the assumption of true probabilities or data-generating distributions can be misleading and obscure both the choices made and the goals pursued in machine learning practice. Based on these considerations, we suggest avoiding the assumption of data-generating probability distributions in the social world.