arxiv_cs_ai 2026年4月20日

言語モデルがセマンティックな教師：医療音声理解のためのトレーニング後の整列

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Translated: 2026/4/20 11:18:13

language-modelsmedical-audiopost-training-alignmentclinical-intelligencehealth-monitoring

Japanese Translation

arXiv:2512.04847v2 Announce Type: replace-cross アブストラクト: プリー-trained 音声モデルは聴診音における音響パターンを検出することに優れていますが、その臨床的な意義を捉えきれないことが多く、診断タスクにおける適用性と性能の制限となっています。このギャップを埋めるために、AcuLa（Audio-Clinical Understanding via Language Alignment）という軽量なトレーニング後の枠組みを提案しました。これにより、任意の音声エンコーダーにセマンティックな理解を持たせ、医療言語モデルを「セマンティック教師」として整列させることができます。大規模な整列を可能にするために、既存の音声記録を伴う豊富な構造化されたメタデータを、オフザショールフの大型言語モデルを活用して一貫した臨床報告書に翻訳した大規模なデータセットを構築しました。私たちの整列戦略は、表現レベルのコントラスト的制約と自己教師ありモデリングを組み合わせており、モデルが臨床的な意味論を学習させつつ微細な時間的な手がかりを保持させます。AcuLa は、異なる 10 のデータセットからなる 18 種類の異なる心臓・呼吸器タスクにおいて最先进の結果を達成し、分類ベンチマークの平均 AUROC を 0.68 から 0.79 に、最も困難な COVID-19 咳検出タスクにおいては 0.55 から 0.89 に AUROC を高めました。私たちの研究は、この音声 - 言語整列が純粋な音響モデルを臨床的に備えた診断ツールに変換し、音声ベースの健康モニタリングにおける生理学的理解を強化するための新しいパラダイムを確立することを示しました。

Original Content

arXiv:2512.04847v2 Announce Type: replace-cross Abstract: Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.