arxiv_cs_lg 2026年2月10日

制約感知生成オートバイディング：パーレオ優先レグレット最適化によるアプローチ

Constraint-Aware Generative Auto-bidding via Pareto-Prioritized Regret Optimization

Translated: 2026/3/15 15:03:44

generative-modelsrecommender-systemsregret-optimizationconstraint-satisfactionreinforcement-learning

Japanese Translation

arXiv:2602.08261v1 発表タイプ：新規要旨：オートバイディングシステムは、目標コストパフォーマンス（Target Cost-Per-Action, CPA）などの厳密な効率制約を満たしながらマーケティング価値を最大化することを目的としています。決定トランスフォーマーが強力な連続モデル化能力を提供する一方で、この制約付き環境への適用では二つの課題が生じます：1) 標準的なリターン・トゥ・ゴー（Return-to-Go）条件付けはコスト次元を無視するため状態アライasingを引き起こし、リソースの正確な調整を妨げる; 2) 標準的な回帰はポリシーを平均的履歴行動に似せるように強制し、制約境界に向かってパフォーマンスを最適化する能力を制限します。これらの課題に対処するため、我々は制約感知生成オートバイディングフレームワークである PRO-Bid を提案しました。PRO-Bid は 2 つの協調的なメカニズムに基づいています：1) 制約分離パーレオ表現（Constraint-Decoupled Pareto Representation, CDPR）はグローバル制約を再帰的なコストとバリュー文脈に分解しリソースの認識を回復させ、同時に関連付け frontier に基づいてトレイクトレーブルを再重み付けして高効率データに焦点を当てます; 2) カウンターファクトルレグレット最適化（Counterfactual Regret Optimization, CRO）はグローバルアウトカム予測者を活用して優れた反事実行動を特定することで能動的な改善を可能にします。これらの高利用アウトカムを重み付けされた回帰目標として扱うことで、モデルは履歴平均を超え、最適な制約境界に近づきます。2 つの公開ベンチマークとオンライン A/B テストによる広範な実験では、PRO-Bid が最良のベースラインと比較して優れている制約満足度と価値獲得を示しました。

Original Content

arXiv:2602.08261v1 Announce Type: new Abstract: Auto-bidding systems aim to maximize marketing value while satisfying strict efficiency constraints such as Target Cost-Per-Action (CPA). Although Decision Transformers provide powerful sequence modeling capabilities, applying them to this constrained setting encounters two challenges: 1) standard Return-to-Go conditioning causes state aliasing by neglecting the cost dimension, preventing precise resource pacing; and 2) standard regression forces the policy to mimic average historical behaviors, thereby limiting the capacity to optimize performance toward the constraint boundary. To address these challenges, we propose PRO-Bid, a constraint-aware generative auto-bidding framework based on two synergistic mechanisms: 1) Constraint-Decoupled Pareto Representation (CDPR) decomposes global constraints into recursive cost and value contexts to restore resource perception, while reweighting trajectories based on the Pareto frontier to focus on high-efficiency data; and 2) Counterfactual Regret Optimization (CRO) facilitates active improvement by utilizing a global outcome predictor to identify superior counterfactual actions. By treating these high-utility outcomes as weighted regression targets, the model transcends historical averages to approach the optimal constraint boundary. Extensive experiments on two public benchmarks and online A/B tests demonstrate that PRO-Bid achieves superior constraint satisfaction and value acquisition compared to state-of-the-art baselines.