arxiv_cs_cv 2026年2月10日

ビデオにおけるプロセス・オブ・スローヴ (Process-of-Thought) リーゼニング

Process-of-Thought Reasoning for Videos

Translated: 2026/3/15 18:05:41

video-understandingprocess-of-thoughtvision-languagereasoningmulti-step-reasoning

Japanese Translation

arXiv:2602.07689v1 Announce Type: new Abstract: ビデオ・アンダースタンディングは、視覚的内容を認識するだけでなく、長期でノイズのある観測に対して時間的根拠に基づいた多段階の推論を行うことを必要としています。私たちは、ビデオ推論を軽量かつ検証可能なステップのシークエンスに構造化することで、推論プロセスを明示化する「プロセス・オブ・スローヴ (PoT) リーゼニング」のためのフレームワークを提案します。PoT は、(i) 時間的情報の選択、(ii) ステップごとの状態更新、および (iii) 制約付き回答の合成を交互に行い、モデルが仮説を段階的に精錬できるようにし、ビデオ証拠へのトレーサビリティを保ちます。このフレームワークはモデルに依存しないように設計されており、既存のビジョン・ランゲージのバックボーンに Plug-in できます。これにより、クローズドブック・リーゼニングおよび外部ツールを用いたエビデンス・エンハンスド・リーゼニングの両方をサポートします。さらに、中間決定を時間的セグメントと整列させる PoT トレースのための統一表現を導入し、惑引要因に対する頑健性を向上させ、幻覚的な説明を削減します。標準的なビデオ・リーゼニングタスクにおける大規模実験により、PoT が事実的正確性と時間的根拠を常に改善し、診断およびダウンストリーム利用のための解釈可能なリーゼニング・トレースを提供するものであることが実証されました。

Original Content

arXiv:2602.07689v1 Announce Type: new Abstract: Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.