arxiv_cs_cv 2026年4月24日

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

Translated: 2026/4/24 19:48:16

reinforcement-learninggui-groundingvisual-criticco-evolutionspatial-localization

Japanese Translation

arXiv:2604.21268v1 発表タイプ: cross 摘要：グラフィカルユーザインタフェース（GUI）のグラウンディング（Grounding）は、自然言語の指示を正確なピクセル座標にマッピングすることを要請します。しかし、視覚的に同質な要素および高密度なレイアウトが存在することにより、モデルは通常意味的な意図を理解することはできても、正確な局所的な配置を達成するには困難を強いられます。試行数を拡大した Pass@k スケールアップが潜在的な改善を暗示しても、幾何学的クラスタリングに基づく静的な自己一貫性戦略は限定的な改善しか与えず、モデルの予測は空間的に散在しやすいためです。本論文では、静的な一貫性戦略を、スクリーンショット上にレンダリングされたプロポザルを評価し、最適なターゲットを選択する学習可能な選択機構に置き換えることを提案します。モデルのグラウンディング能力と批評能力の間にある大きな乖離を考慮し、プロポーザルとその後、批評を行う枠組みの共同進化（co-evolving）を提案します。これら両者を共同で最適化するため、成熟度感知型の適応的共進化強化学習パラダイムを導入しました。このアプローチでは、プロポーザルの訓練目標と批評者の訓練目標が動的にバランスされ、プロポーザル出力の多様性は批評者の頑健性を高め、批評者の成熟した識別能力は逆にプロポーザルの広範な空間的探索の可能性を開き、両者の相互強化と共同進化を促進し、多様かつ複雑なインターフェースレイアウトに適応する汎用性を保証します。6 つのベンチマークにわたる大規模な実験で、我々の手法はグラウンディング精度と批評者の信頼性を著しく向上させたことを示しました。

Original Content

arXiv:2604.21268v1 Announce Type: cross Abstract: Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model's grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer's outputs enhances critic robustness, while the critic's maturing discrimination capability conversely unlocks the proposer's potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.