arxiv_cs_lg 2026年2月10日

公平を考慮した報酬最適化

Fairness Aware Reward Optimization

Translated: 2026/3/15 14:48:14

llmfairnessreinforcement-learningalignmentarxiv

Japanese Translation

arXiv:2602.07799v1 発表タイプ：新しい摘要：人間の好意データにおける人口統計学的バイアスが、報酬モデルを通じて整った大規模言語モデル (LLM) へ系統的な不公平を伝播させる。我々は、人口統計学的公平性、等価な機会、または反事象的公平性の制約の下で報酬モデルを訓練するプロセシング内のフレームワーク「Faro (Fairness Aware Reward Optimization)」を導入した。我々は LLM 整訓における報酬レベルの公平性に対する初めての理論的分析を提供し、(i) 制御可能なスラックを持つ Faro 訓練された報酬に対する証明可能な公平性証明、(ii) KL 制約による微調整が引き起こす精度と公平性のトレードオフの形式的記述（公平性が報酬からポリシーへ移行することを証明）、および (iii) 空でないパレート前線の存在を確立した。事前および事後処理手法とは異なり、Faro は報酬モデルが序数（正しいランク付け）、カルドナル（正確な尺度）、かつ公平であることを同時に保証する。複数の LLM とベンチマークにおいて、Faro はバイアスと有害な生成を大幅に削減しつつ、モデルの品質を維持または改善する。

Original Content

arXiv:2602.07799v1 Announce Type: new Abstract: Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro significantly reduces bias and harmful generations while maintaining or improving model quality.