arxiv_cs_ai 2026年2月10日

Pts-snn: 一言調節時間変動神経ネットワークを用いた、効率的な会話感情認識

PTS-SNN: A Prompt-Tuned Temporal Shift Spiking Neural Networks for Efficient Speech Emotion Recognition

Translated: 2026/3/7 9:56:47

sersnnemotion-recognition

Japanese Translation

会話感情認識(SER)は、人間とコンピュータのインタラクションに広く活用されていますが、伝統的なモデルの高い計算コストがリソース制限されたエンジンデバイス上での実装を阻んでいる。シピン神経ネットワーク(SNNs)は、そのイベントに基づく性質によってエネルギー効率のある代替案でありながら、時間変動型の自身監督学習(SSL)の表現と組み合わせるのは困難です。分布不matchにより、高ダイナミックレンジエンディングがトランスルースとなるニューロンの閾値に基づく神経を持つ情報コーディング能力を低下させる一方、これに対処するために、我々はPrompt-Tuned Spiking Neural Networks(PTS-SNN)を開発しました。これはパラメータ効率の新たな非モルヒンド変換フレームワークで、凍結SSL本体とシピンダイナミクスとの整合性を保つことを目指しています。ここで具体的には、移動可能なチャンネルスライドを導入したタイムシャッフル・シンセーターエンコーダによる局所的な時系列関係の捕捉が行われ、安定的な特徴ベースは確立されました。より詳細な情報が得られるようにすることを目的としているため、我々はコンテキストアワーアーム・メモリー電位キャリブレーション戦略を開発しました。このメカニズムは、Spiking 動的線形注意モジュールを使用して、グローバルなセマンティック情報から学習可能な柔軟性のあるプッシュをまとめる事で、parametric ロイリー・インテグラントー・フリーアンプロンテンションがニューロンの偏流電圧を動的に調節します。これにより、多種多様な入力分布に中心を取り寄せることで、機能的な静止状態と過載から機能性を守ることができます。IEMOCAP, CASIA, EMODBなどの5つの国際多言語データセットに対する効果的な検証で確認されていますが、PTS-SNNは73. 34％の精度を達成し、IEMOCAPでのIEMOCAPに同等の競争性的な人工ニューラルネットワーク(ANNs)を上回ることさえ可能となっています。パラメーターとその推定インフェーザーのエネルギーは両方とも0. 35 mJで、シンプルな入力サンプルに対して実行時エネルギーが消費されています。

Original Content

arXiv:2602.08240v1 Announce Type: new Abstract: Speech Emotion Recognition (SER) is widely deployed in Human-Computer Interaction, yet the high computational cost of conventional models hinders their implementation on resource-constrained edge devices. Spiking Neural Networks (SNNs) offer an energy-efficient alternative due to their event-driven nature; however, their integration with continuous Self-Supervised Learning (SSL) representations is fundamentally challenged by distribution mismatch, where high-dynamic-range embeddings degrade the information coding capacity of threshold-based neurons. To resolve this, we propose Prompt-Tuned Spiking Neural Networks (PTS-SNN), a parameter-efficient neuromorphic adaptation framework that aligns frozen SSL backbones with spiking dynamics. Specifically, we introduce a Temporal Shift Spiking Encoder to capture local temporal dependencies via parameter-free channel shifts, establishing a stable feature basis. To bridge the domain gap, we devise a Context-Aware Membrane Potential Calibration strategy. This mechanism leverages a Spiking Sparse Linear Attention module to aggregate global semantic context into learnable soft prompts, which dynamically regulate the bias voltages of Parametric Leaky Integrate-and-Fire (PLIF) neurons. This regulation effectively centers the heterogeneous input distribution within the responsive firing range, mitigating functional silence or saturation. Extensive experiments on five multilingual datasets (e.g., IEMOCAP, CASIA, EMODB) demonstrate that PTS-SNN achieves 73.34\% accuracy on IEMOCAP, comparable to competitive Artificial Neural Networks (ANNs), while requiring only 1.19M trainable parameters and 0.35 mJ inference energy per sample.