arxiv_cs_ai 2026年4月24日

多カリブレーション大規模言語モデルによる偏りのない有病率推定

Unbiased Prevalence Estimation with Multicalibrated LLMs

Translated: 2026/4/24 20:17:42

llmmulticalibrationcovariate-shiftmeasurement-errorbias-estimation

Japanese Translation

arXiv:2604.21549v1 Announce Type: new 摘要：不確実な測定機器（診断テスト、分類器、または大規模言語モデルなど）を用いて、集団内のカテゴリ有病率を推定することは、科学、公衆衛生、およびオンラインの信頼性と安全において基本的な課題です。標準的なアプローチでは、既知の機器誤率を補正しますが、それらの率が集団間で安定しているという前提に立っています。本研究では、この仮説が共変量のシフト（covariate shift）の条件下で成り立たないことを示し、入力特徴条件付きカルリブレーションを強制し、平均カルリブレーションだけでは不十分である多カリブレーションが、このようなシフト下において偏りのない有病率推定に適すると示しました。標準的なカルリブレーションと定量化方法は、この保証を提供できません。当社の工作は、最近の公平性に関する理論的研究と、ほぼ全ての学問分野にまたがる長い間存在した測定問題を結びつけました。シミュレーションでは、標準的な方法はシフトの大きさとともに増大する偏りを示したが、多カリブレーション推定者はほぼゼロの偏りを持っていたことが確認されました。我々は主に大規模言語モデル（LLMs）に焦点を当てていますが、理論的結果はどんな分類モデルにも適用できます。2 つの実験的応用——米国各州の雇用有病率の推定（米国コミュニティ調査を使用）、および 4 カ国の政治文書の分類（LLM を使用）——は、多カリブレーションが実践において著しく偏りを減少させたと示しつつ、目標集団が異なる可能性があります重要な特徴次元をカバーする必要があることを見出しました。

Original Content

arXiv:2604.21549v1 Announce Type: new Abstract: Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications -- estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM -- demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.