arxiv_cs_lg 2026年2月10日

Sample-Efficient Model Performance Estimators に対するフォールト・トランザレントな評価

Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators

Translated: 2026/3/15 13:05:17

fault-tolerantsample-efficientmodel-performancebias-varianceestimator-calibration

Japanese Translation

arXiv:2602.07226v1 発表タイプ：新規要約：サービスとしてのモデル（MaaS）の時代において、組織は迅速なデプロイのために第三者の AI モデルにますます依存しています。しかし、新興 AI アプリケーションの動的な性質、新たなデータセットの継続的な導入、そして「卓越したパフォーマンス」を謳うモデルの増大により、モデルサービスの効率的かつ信頼性の高い検証はますます挑戦的になっています。これは、ラベリング用インスタンスを選択的に選定することにより标注コストを削減し、モデルのパフォーマンスを推定することを目的とした「サンプル効率の高い性能推定器」の開発を促しています。しかし、既存の評価手法は低分散の状況においてしばしば失敗します：RMSE はバイアスと分散を混同し、分散が小さい際に恒久的バイアスを掩蔽（masking）する一方で、p 値に基づく検定は過剰に過敏になり、微小なずれがあっても十分な推定器を棄却してしまいます。これに対処するために、私達はこの分野で「フォールト・トランザレントな評価枠組み」を提案しました。これはバイアスと分散を考慮統合し、調整可能な閾値 ${\\varepsilon}$ 内で評価を行う枠組みであり、実用上許容误差幅度内の性能推定器の評価を可能にします。私達は理論的に、${\\varepsilon}$ の適切なキャリブレーションが異なる分散領域において信頼できる評価を確保することを示し、さらに ${\\varepsilon}$ を自動的に最適化・選択するアルゴリズムを提案しました。実世界のデータセット上での実験は、私達の枠組みが推定器の振る舞いに関する包括的かつ実用的な洞察を提供することを証明しています。

Original Content

arXiv:2602.07226v1 Announce Type: new Abstract: In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of new datasets, and the growing number of models claiming superior performance make efficient and reliable validation of model services increasingly challenging. This motivates the development of sample-efficient performance estimators, which aim to estimate model performance by strategically selecting instances for labeling, thereby reducing annotation cost. Yet existing evaluation approaches often fail in low-variance settings: RMSE conflates bias and variance, masking persistent bias when variance is small, while p-value based tests become hypersensitive, rejecting adequate estimators for negligible deviations. To address this, we propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level ${\varepsilon}$, enabling the evaluation of performance estimators within practically acceptable error margins. We theoretically show that proper calibration of ${\varepsilon}$ ensures reliable evaluation across different variance regimes, and we further propose an algorithm that automatically optimizes and selects ${\varepsilon}$. Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.