arxiv_cs_cv 2026年2月10日

CAE-AV: モーダル間相互作用による音視学習の向上

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

Translated: 2026/3/16 14:04:18

cae-avaudio-visual-learningcross-modalvisual-attentionmachine-learning

Japanese Translation

arXiv:2602.08309v1 Announce Type: new 摘要: 音視学習は、オフスクリーン源やバックグラウンドの混雑によって生じるモーダルミスマッチに悩まされており、現在の手法は不要な領域や瞬間を増幅することで、不安定な訓練と品質低下を引き起こします。この課題に対処するために、音視学習のために Caption-aligned と Agreement-guided Enhancement framework (CAE-AV) を提案しました。このフレームワークには、CAE-AV にはモーダルミスマッチを緩和するために 2 つの補完モジュールを採用しています：Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) と Caption-Aligned Saliency-guided Enrichment (CASE)。CASTE はフレーム単位の音視合意を評価することで空間的および時間的な関係を動的に平衡させ、ミスマッチが生じている際にも先行的および後行的なフレームから主要情報をキャッチします。CASE は選定された空間時間的位置にモーダル間セマンティックガイドを導入し、高レベルのセマンティックフーを介してミスマッチをさらに軽減します。さらに、トークン選択を案内しモーダル間セマンティック整合性を強化するために、キャプションからモーダリティへの情報 NCE、視覚 - 音声の一貫性、およびエントロピー正規化などの軽量な目標を設計しました。凍結されたバックボーンを使用して、CAE-AV は AVE、AVVP、AVS、および AVQA ベンチマークで state-of-the-art パフォーマンスを実現し、定性分析は音視ミスマッチに対する頑健性をさらに確認しました。

Original Content

arXiv:2602.08309v1 Announce Type: new Abstract: Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.