arxiv_cs_cv 2026年2月10日

MS-Mix: ミックスアップの力を解き明かすマルチモーダル感情分析のための革新

MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis

Translated: 2026/3/15 15:00:42

multimodal-sentiment-analysismixupdeep-learningaugmentationemotion-recognition

Japanese Translation

arXiv:2510.11579v2 Announce Type: replace Abstract: マルチモーダル感情分析（MSA）は、テキスト、動画、音声などの異なったデータソースからの情報統合を通じて、人間の感情を特定し解釈することを目的としています。深層学習モデルはニューラルネットワークアーキテクチャの設計において進歩しましたが、マルチモーダル標注データの希少性により依然として制約を受けています。Although ミックスアップに基づく拡張は単一モーダルタスクにおける汎化性能向上を改善しますが、それを直接的に MSA に適用すると、感情に敏感なミックスメカニズムの欠如により、ランダムなミックスがしばしばラベルの曖昧性を拡大し、意味的不整合を生むという重大な課題が生じます。これらの問題を克服するために、我々はマルチモーダル環境において自動的にサンプルミックスを最適化する適応的、感情に敏感な拡張フレームワークである MS-Mix を提案します。MS-Mix の主要な要素は以下の通りです：(1) 矛盾する感情を持つサンプルをミックスすることによる意味論的混淆を防ぐために効果的な感情に敏感サンプル選択（SASS）戦略。(2) 各自の感情的強度に基づいて動的に計算するマルチヘッド自己注意を用いた感情強度指向（SIG）モジュール。(3) モダリティ間の予測分布を整合させる感情整合損失（SAL）と、Kullback-Leibler 距離に基づく損失関数を追加の正規化項として統合し、感情強度予測者とバックボネットネットワークを共同で訓練する損失関数。3 つのベンチマークデータセットに 6 つの最先进のバックボネットを実験的に適用した結果、MS-Mix は既存の手法を常に優越しており、頑健なマルチモーダル感情拡張のための新たな基準を確立しました。ソースコードは以下の URL で利用可能です：https://github.com/HongyuZhu-s/MS-Mix.

Original Content

arXiv:2510.11579v2 Announce Type: replace Abstract: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.