arxiv_cs_cv 2026年2月10日

Demo-ICL: プロシージャルビデオの知識習得のための文脈内学習

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Translated: 2026/3/16 14:05:04

multimodal-language-modelsvideo-understandingin-context-learningmachine-learningbenchmark-design

Japanese Translation

arXiv:2602.08439v1 発表タイプ：新しい要旨：最近のマルチモーダル大規模言語モデル（MLLM）の動画理解能力は急速に向上しているにもかかわらず、既存の動画ベンチマークはモデルが動的かつ新しい文脈から少数の例によって学習・適応できる能力ではなく、モデルが保有する静的内部知識に基づいて評価する傾向にあり、このギャップを埋めるため、当論文では文脈内デモンストレーションから学習し、対象動画を質問に答えることに焦点を当てた「デモ駆動型動画文脈内学習」という新しいタスクを提示します。また、デモ駆動型動画文脈内学習の能力を評価するための困難なベンチマークである「Demo-ICL-Bench」も提案します。Demo-ICL-Bench は、関連する質問と共に 1,200 件の教育的 YouTube ビデオから構成されており、2 種類のデモンストレーションが導出されます：(i) 動画字幕を要約したテキストデモンストレーションと (ii) 対応する教育用的ビデオである動画デモンストレーションです。この新しい課題を効果的に解決するために、当論文では文脈内例から学習する能力を向上させるため、動画監視微調整と情報援助された直接偏好最適化を有する 2 段階のトレーニング戦略を採用した「Demo-ICL」という MLLM を開発しました。最先進 MLLM による大規模な実験により、Demo-ICL-Bench の困難性が確認され、Demo-ICL の有効性が示され、それにより将来の研究方向が解明されました。

Original Content

arXiv:2602.08439v1 Announce Type: new Abstract: Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.