arxiv_cs_cv 2026年4月20日

Find, Fix, Reason: 動画推論のための文脈修復

Find, Fix, Reason: Context Repair for Video Reasoning

Translated: 2026/4/20 10:46:22

video-reasoningreinforcement-learningcontext-repairmulti-modalgrpo

Japanese Translation

arXiv:2604.16243v1 Announce Type: new Abstract: 強化学習は大規模マルチモーダルモデルの動画推論を進歩させていますが、支配的なパイプラインは、モデルの知識の境界で停滞するオンポリシー自探究に依存するか、政策と政策を混在させる必要があり慎重な正規化を要求するハイブリッドリプレイに頼っています。動的コンテキスト手法は集中した証拠にズームインしますが、多くの場合クオリティの事前トレーニングと2段階チューニングを必要とし、かつそのコンテキストは小型モデルの能力によって制限されます。一方、大型モデルは指示に従う能力やマルチモーダル理解に優れ、小型モデルに対してより豊かなコンテキストを提供でき、単純なツールを通じてターゲット領域へ素早くズームインできます。この能力に基づいて、観察レベルの介入を提案します：凍結されたツール統合による教師は、欠如する時間空間依存性を特定し、元動画を最小限の証拠パッチ（例：タイムスタンプ、領域など）を提供し、疑問文そのものは変えません。学生は追加された文脈を用いて再度回答し、選択されたロールアウトSchemeがGroup Relative Policy Optimization (GRPO) に統合されたことでトレーニングを更新します。我々は、2つの目標（正しい答えを通じた結果の妥当性、引用された証拠を反映する理由を通じた依存性の整列）と最適化が整合するようにRobust Improvement Reward (RIR) を提案します。優位点はバッチに対して正規化され、オンポリシー探索を維持しつつ、最小限のトレーニングスタック変更によって因果的に意味のある方向に沿ってそれを導きます。様々な関連ベンチマークにおける実験は、一貫した精度向上と強力な一般化性を示しています。ウェブページとソースコードは https://github.com/JethroJames/FFR.git で利用可能です。

Original Content

arXiv:2604.16243v1 Announce Type: new Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at https://github.com/JethroJames/FFR.git.