arxiv_cs_cv 2026年4月20日

人間のように映画を観る：エンボディアッドコンパニオン向けの自分中心的情緒理解

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

Translated: 2026/4/20 10:42:58

embodied-aiscreen-viewemotion-understandingmultimodal-mlrosalia

Japanese Translation

arXiv:2604.15823v1 発表タイプ：新規要旨：エンボディアッドロボットエージェントは、本質的な映画映像ではなく自分中心の画面ビューインターフェースを通じて映画を認識するため、視点歪み、スケールの変化、照明の変化、および環境への干渉といったドメインシフトを招く。しかし、既存の映画感情理解研究はほぼ完全に映画映像に限定されており、現実世界の視聴シナリオへのクロスドメイン汎化を制限している。このギャップを解消するために、我々はエンボディアッド画面ビュー映画感情理解のための最初のベンチマークデータセットである EgoScreen-Emotion (ESE) を導入した。ESE は、制御されたエンボディアッド画面ビュー条件下で撮影された 224 本の映画予告編を含み、多ラテアレーサーが確信度感知マルチラベルプロトコルを用いて注釈付けを行った 28,667 枚の時系列整合されたキーフレームを生成している。我々はさらに、時間的視覚的証拠、物語要約、圧縮された履歴的文脈、および音声の手がかりをモデル化するマルチモーダル長文脈感情推理フレームワークを構築した。クロスドメイン実験では、シビアなドメインギャップが明らかになった：映画映像でトレーニングされたモデルは、現実的なエンボディアッド画面ビュー観察で評価される際に Macro-F1 が 27.99 から 16.69 まで低下した。ESE でのトレーニングは、現実的な視聴条件下での堅牢性を大幅に改善した。我々のアプローチは、強力なクローズドソースマルチモーダルモデルと比較して競争的な性能を実現し、ドメイン固有のデータと長文脈マルチモーダル推理の重要性を強調している。

Original Content

arXiv:2604.15823v1 Announce Type: new Abstract: Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.