arxiv_cs_cv 2026年2月10日

Weak to Strong: VLM ベースの仮ラベル付けをマルチモーダルビデオに基づく隠れ感情理解タスクにおける軽微な监督学習戦略として

Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

Open original article

Translated: 2026/3/15 19:04:07

vlmweak-supervisionmultimodalemotion-recognitionarxiv

Japanese Translation

arXiv:2602.08057v1 発表タイプ：新しい摘要: 本論文は、ビデオにおける「隠れた感情」の自動認識に課題を解決するために、マルチモーダルな軽微监督フランクワークを提案し、iMiGUE テニスインタビューデータセットで最良の成績を達成しました。まず、YOLO 11x が人間のポートレートをフレームごとに検出・切り出し、DINOv2-Base が切り出された領域から視覚的特徴を抽出します。次に、Chain-of-Thought と Reflection プロンプティング（CoT + Reflection）を統合し、Gemini 2.5 Pro が自動的に仮ラベルと推論テキストを生成して、下流モデル用の軽微监督として機能させます。その後、OpenPose が 137 次元のキーポイントシーケンスを生成し、フレーム間のオフセット特徴を付与します。一般的なグラフニューラルネットワークバックボーンは MLP に簡素化され、3 つのキーポイントストリームの時空間関係を効率的にモデル化します。超長シーケンス Transformer が画像とキーポイントシーケンスを独立にエンコードし、その表現は BERT エンコードされたインタビュートランスクリプトと連結されます。各モーダルは最初別々に事前学習され、次に共同で微調整され、仮ラベル付与サンプルを学習セットに統合し、さらに性能向上を図ります。実験结果表明、嚴重的クラス不平衡を踏まえても、提案されたアプローチは従来の作業における 0.6 未満の精度を 0.69 以上の精度に引き上げ、新しい公開ベンチマークを確立しました。この研究は、「MLP 化された」キーポイントバックボーンがこのタスクで GCN ベースの対応物を追従したり、それ以上になったりすることをまた検証しました。

Original Content

arXiv:2602.08057v1 Announce Type: new Abstract: To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.