arxiv_cs_ai 2026年2月10日

社会的な強化学習における目的分離：自己中心的評価者による正解の回復

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

Translated: 2026/3/7 9:50:46

reinforcement-learningobjective-decouplingsocial-reinforcement-learningepistemic-source-alignment

Japanese Translation

現代のAIの統一戦略は、物凄く脆弱な仮定を持っています。それは、人間のフィードバックが騒動していても、基本的には真実である信号だとするものです。この記事では、強化学習（RL）において「4号信仰」を指摘します。リラックスした環境下では、「4号信仰」は正当と示されます。しかし、社会的な環境では評価者が自己偏愛的、怠け者としてもうろており、または敵対的に行動することもあります。したがって、通常のRLで訓練された-agentが、我々を「目的分離」という新たな構造上の問題に直面し始めます．つまり、これらの(agent)は目標が明確に解き明かされなくなっていきます。結果として、この状態においては、統一が成立しなくなり、「誤った並行」につながることになります。我々の提案する解決策は「知識源統一（ESA）であるためです．これは統認的でありつつ、多数派を信用しながらも、実際のフィードバック発信者を見つけるために安全な仮説によって制御されています。した結果が、誤意のある評価者の多い場合であっても統一が成立します。我々はこの新しいアプローチの効果の証明として、一般的に「統一」が機能せずに、我々の方法で最適なpolicyが作成されますことを示しました．

Original Content

arXiv:2602.08092v1 Announce Type: new Abstract: Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent's learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this "judging the judges" mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.