arxiv_cs_cv 2026年4月24日

SGG-R$^{ m 3}$：トークン予測から始まる、エンド・エンドでバイアスなしのシーングラフ生成へ

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Translated: 2026/4/24 19:51:37

scene-graph-generationmultimodal-large-language-modelsreinforcement-learningchain-of-thoughtsupervised-fine-tuning

Japanese Translation

arXiv:2603.07961v3 Announce Type: replace Abstract: シーングラフ生成（SGG）は、視覚的なシーンをオブジェクトとその関係性のグラフとして構造化します。多モード大規模言語モデル（MLLM）によるエンド・エンド SGG が進歩したものの、現在の手法はタスク固有の構造化された推論の欠如、そして疎かつ長尾分布の関係性の分布課題によって制限されており、低再現率とバイアスのある予測を示す不完全なシーングラフに終わっています。これらの問題に対処するため、我々は SGG-R$^{ m 3}$という構造化された推論フレームワークを導入しました。これは、タスク固有のチェイン・オブ・スノー（CoT）ガイドによる監督微調整（SFT）と強化学習（RL）を、グループ序列政策最適化（GSPO）と統合し、三つの連続段階を経てエンド・エンドバイアスなしのシーングラフ生成を実現することを目的としています。SFT 段階では、MLLM を活用し、埋め込み類似性フィルタリングによる精緻化を経て、関係性の希少性を緩和する関係性の増強戦略を提案しました。その後、RL 段階において、プロセス推論を最適化するステージ対応した報酬シームを採用しました。具体的には、微細粒度と粗粒度の関係報酬を統合した新しい二重粒度報酬を提案し、頻度に基づく適応的な加重を通じて長尾問題を緩和し、意味的なクラスター化を通じて関係性の被覆性を向上させました。二つのベンチマークにおける実験结果表明、SGG-R$^{ m 3}$は既存の手法より優れた性能を示しており、該当の有効性と汎用性を示しています。

Original Content

arXiv:2603.07961v3 Announce Type: replace Abstract: Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.