arxiv_cs_ai 2026年4月24日

OpenEstimate: 不確実性下での推論における LLM の評価（実世界データの活用）

OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data

Translated: 2026/4/24 20:30:12

openestimatellmreasoning-under-uncertaintybenchmarkmachine-learning

Japanese Translation

arXiv:2510.15096v2 Announce Type: replace 要旨：言語モデル（LM）が医療、金融、その他の知識労働分野など広範なドメインで運用されている実際の環境では、モデルは不完全な情報を扱う必要があります。しかし、多くの LLM 評価は定義された答えと成功基準を持つ問題に焦点を当てています。このギャップは、不確実性を含む自然な問題を構築するのが困難であるための一部であり、LM が同じ知識にアクセスしているという点です。LM に正解を生産できずに苦労させる質問を設計することは、人間が信憑的に答えられるものに対して非自明のものです。結果として、不確実性下での推論における LM の性能は依然として十分に評価されていません。このギャップに対応するため、背景情報を大規模に合成し、確率的先验として予測を表現することをモデルに求める数値推定タスクにおける LLM 評価のための拡張性のあるマルチドメインベンチマーク「OpenEstimate」を導入しました。我々はこれらの先验を精度と校正性で評価し、興味のある真の分布からのサンプルとの比較で有用性を定量化しました。6 つの frontier LLM において、我々は LM から誘導された先验はしばしば不正確であり、過度な自信を持有していることが見出されました。パフォーマンスの向上は、不確実性がモデルから誘導される方法によってわずかですいが、サンプリング戦略、推論の努力、またはプロンプトデザインの変更には大きく影響されません。OpenEstimate ベンチマークは、frontier LLM にとって挑戦的な評価を提供するとともに、確率的推定と不確実性下での推論に優れたモデルを開発するためのプラットフォームとなります。

Original Content

arXiv:2510.15096v2 Announce Type: replace Abstract: Real-world settings where language models (LMs) are deployed -- in domains spanning healthcare, finance, and other forms of knowledge work -- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.