arxiv_cs_cv 2026年2月10日

SpatialReward: 明示的な空間推論によるオンライン RL における画像編集での認識ギャップの架橋

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Translated: 2026/3/15 18:03:13

reinforcement-learningcomputer-visionimage-editingartificial-intelligenceevaluation-benchmarks

Japanese Translation

arXiv:2602.07458v1 Announce Type: new オンライン強化学習（RL）は複雑な画像編集に対して有望な道を開きつつありますが、現在、信頼性と詳細なリワードシグナルの希少さに制約されています。既存のエバリュエーターは、クロス画像比較の忘却や詳細な特徴のキャプチャ不全という、私たちが「Attention Collapse（注意崩壊）」と呼びる重要な認識ギャップに直面することが多く、不正確な認識とスコアのミスマッチを招きます。これらの限界に対処するために、明示的な空間推論を介した正確な検証を強制するリワードモデルである SpatialReward を提案します。推論を予測された編集領域に根ざすことで、SpatialReward はピクセルレベルのエビデンスに语义的判断を結び付け、評価の精度を大幅に向上させます。厳選された 260,000 件の空間認識データセットで訓練された当モデルは、MMRB2 と EditReward-Bench で最優秀性能を示し、我らが提案した MultiEditReward-Bench では専有エバリュエーターを上回りました。さらに、SpatialReward はオンライン RL における強固なシグナルとして機能し、GEdit-Bench 上で OmniGen2 を +0.90 向上させました。これは、最優秀判別モデルを凌駕し、GPT-4.1 の +0.45 増幅を 2 倍に達しました。これらの結果は、画像編集での効果的なアライメントUnlocking を unlocked するために空間推論が不可欠であることを示しています。

Original Content

arXiv:2602.07458v1 Announce Type: new Abstract: Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.