arxiv_cs_cv 2026年4月24日

人間と AI による監督下で精緻なビデオ言語の構築

Building a Precise Video Language with Human-AI Oversight

Translated: 2026/4/24 19:46:09

video-language-modelsvideo-captioninghuman-ai-oversightreinforcement-learningcomputer-vision

Japanese Translation

arXiv:2604.21718v1 発表タイプ：新規要旨: ビデオ・ラングウェアモデル（VLM）は、自然言語を通じて動的な視覚世界を推理し学ぶ。当社は、拡大可能な監督を可能にする開かれたデータセット、ベンチマーク、およびレシピの套件を導入し、精緻なビデオキャプション化を実現した。まず、映画人などのプロのビデオクリエイターと共同開発された数百の慎重に定義された視覚プリミティブに基づき、主役、シーン、動き、空間、カメラダイナミックスを記述するための構造化された仕様を定義した。次に、高品質なキャプションを編集するために、CHAI（Critique-based Human-AI Oversight）という枠組みを導入し、訓練された専門家によりモデル生成されたプレキャプションが改善されたポストキャプションに修正・改訂される。この役割分担は、テキスト生成をモデルに任せることで注釈精度と効率を向上させ、人間が検証に注力を置くことを可能にする。さらに、プレキャプションとポストキャプション間の批判と好爱是オープンソースモデル（Qwen3-VL）のキャプション生成、報酬モデリング、批判生成への改善において、SFT、DPO、推論時スケーリングを通じて豊富なる監督を提供する。当社の消融実験は、当社の監督枠組みによって保証された批判の精度、再現率、建設性が、下流パフォーマンスを直接支配することを示した。控えめな専門家の監督下で得られたモデルは、Gemini-3.1-Pro などのクローズドソースモデルを優越する結果を示した。最後に、当社のアプローチは大規模プロフェッショナルビデオ（例：映画、コマーシャル、ゲーム）の再キャプション化と、Wan などのビデオ生成モデルのファインチューニングに適用され、最大 400 文字の詳細なプロンプトをより良く追随させることで、カメラの動き、アングル、レンズ、フォーカス、視点、構図などに関するシネマトグラフィへのより繊細な制御を実現した。当社の結果は、精緻な仕様と人間と AI による監督がプロフェッショナルレベルのビデオ理解と生成に不可欠であることを示した。データとコードはプロジェクトページで入手可能です：https://linzhiqiu.github.io/papers/chai/

Original Content

arXiv:2604.21718v1 Announce Type: new Abstract: Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/