arxiv_cs_cv 2026年2月10日

D-ORCA: 会話中心の最適化による頑健な音声視覚キャプション生成

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Translated: 2026/3/15 19:03:06

d-orcaomni-modal-llmaudio-visual-captionsspeech-recognitionreinforcement-learning

Japanese Translation

arXiv:2602.07960v1 発表タイプ：新規要旨：話された対話_video_は情報源の主要なものであるため、誰が何をいつ発言したかを正確に特定することは、ディープ・ビデオ・アンダースタンディングに不可欠である。当社は、 extbf{d}ialogue-centric（会話中心の） extbf{o}mni-modal（オムニモーダルな）大規模言語モデルである D-ORCA を紹介する。D-ORCA は、 extbf{r}obust（頑健な）音声視覚 extbf{c}aptioning（キャプション生成）のために最適化されたモデルである。我々はさらに、DVD という大規模かつ高品質な二言語データセットを整備した。このデータセットは、英語と簡体字中国語においてトレーニング用としてほぼ 40,000 個の複数の当事者を対象とした対話ビデオを含み、評価用として 2,000 個のビデオを含んでいる。これは、オープンソースエコシステムの欠如を解決するものである。微細なキャプションの精度を確保するために、我々はグループ相対政策最適化を採用し、話者帰属の精度、全体の音声コンテンツの精度、および文レベルの時間的境界の一致を評価する 3 つの新しいリワード関数を用いた。これらのリワードは、音声処理で広く使用されている評価指標から導かれ、私の知る限り、これらは強化学習の目標として音声視覚キャプション生成に初めて適用された。大規模な実験は、D-ORCA が既存のオープンソースモデルを話者識別、音声認識、時間的接地において大幅に凌駕することを示している。顕著なこととして、80 億パラメータしか持たないにもかかわらず、D-ORCA はいくつかの一般的な音声視覚理解ベンチマークにおいて Qwen3-Omni と同等の性能を達成した。デモは \\href{https://d-orca-llm.github.io/}{https://d-orca-llm.github.io/} に利用可能です。コード、データ、およびチェックポイントは \\href{https://github.com/WeChatCV/D-ORCA/}{https://github.com/WeChatCV/D-ORCA/} に利用可能です。

Original Content

arXiv:2602.07960v1 Announce Type: new Abstract: Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{https://d-orca-llm.github.io/}{https://d-orca-llm.github.io/}. Our code, data, and checkpoints will be available at \href{https://github.com/WeChatCV/D-ORCA/}{https://github.com/WeChatCV/D-ORCA/}.