arxiv_cs_lg 2026年2月10日

誰が責任を取るのか？現代の AI システムにおける帰属の課題

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

Translated: 2026/3/15 9:05:49

artificial-intelligenceaccountability-attributiondeep-learningmachine-learningmodel-analysis

Japanese Translation

arXiv:2506.00175v4 発表タイプ: 更新要約：現代の AI システムは、事前学習、ファインチューニングのラウンド、そしてその後の適応やアライメントという複数の段階を経て開発され、各段階は前の段階を構築し、モデルを異なった方法で更新する。これにより、帰属の観点から非常に重要な問題が生じる：実 deployment されたモデルが成功または失敗した際に、どの段階が責任を持ち、その範囲はどれほどか？私たちは、モデルの動作をモデル開発プロセスの特定の段階まで遡るための「帰属の帰属」という問題を提起する。この課題に対処するために、私たちは、特定の段階からの更新が行われていなかった場合、モデルの動作はどのように変化っっっっっただろうかという、段階の効果に関する反事実的な質問に回答する一般的なフレームワークを提案した。このフレームワークの中において、モデルを再学習することなく段階の効果を効率的に定量化する推定値を導入し、データとモデル最適化の主要な側面、つまり学習率スケジュール、モーメンタル、および加重率減衰を考慮した。私たちが示すように、私たちのアプローチは各段階のモデルの行動への帰属を成功裏に定量化した。帰属結果に基づき、私たちの手法は、複数の段階にわたって開発された画像分類とテキストの有害性検出タスクで学習された虚偽的相关性を特定し、それを除去できる。私たちのアプローチはモデル解析のための実用的なツールを提供し、より帰属的である AI 開発への重要な一歩である。

Original Content

arXiv:2506.00175v4 Announce Type: replace Abstract: Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.