arxiv_cs_lg 2026年4月24日

(再)校準のための量化不確実性の品質評価 - データ駆動回帰モデルについて

Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

Translated: 2026/4/24 20:08:26

quantum-calibrationdata-driven-modelsuncertainty-estimationregression-analysiscalibration-metrics

Japanese Translation

arXiv:2508.17761v3 Announce Type: replace 摘記：安全至關的な応用において、データ駆動モデルは単に精度が高いだけでなく、信頼できる不確実性の見積もりを提供する必要がある。この特性は、通常「校準（calibration）」と呼ばれるもので、リスク感知型の意思決定において不可欠である。回帰分析においては、多様な校準指標と再校準手法が出現している。しかし、これらの指標は定義、仮定、スケールが異なり、研究間で結果を解釈し比較することが困難である。さらに、多くの再校準手法は限られた指標_subset_のみを用いて評価されており、異なる校準の概念に対して改善が一般化するかどうかは不明である。本稿では、文書を体系的に抽出・分類し、特定のモデリング手法や再校準アプローチに依存せずこれらの指標をベンチマーク化する。制御実験を通じて、現実データ、合成データ、人工的に誤校準されたデータを用いた評価において、校準指標がしばしば矛盾する結果を生み出すことを示した。当社の分析は、多くの指標が同一の再校準結果について評価を異なること、あるいは矛盾する結論を導くことなどを示唆している。この不整合は、特定の指標を揶揄して成功の印象を誤認させる可能性があることにより、特に懸念を呼びかける。我々は、期待正規化校準誤差 (ENCE) と被覆幅基準 (CWC) を、テストにおいて最も信頼できる指標であると識別した。当社の研究結果は、校準研究において指標選択の決定的な役割を強調している。

Original Content

arXiv:2508.17761v3 Announce Type: replace Abstract: In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.