arxiv_cs_cv 2026年2月10日

タスク条件付きプロbing が、指示調整されたマルチモーダル LLM における脳同期パターンの開示

Task-Conditioned Probing Reveals Brain-Alignment Patterns in Instruction-Tuned Multimodal LLMs

Translated: 2026/3/15 17:02:40

multimodal-llmbrain-scienceinstruction-tuningfMRIneural-encoding

Japanese Translation

arXiv:2506.08277v2 Announce Type: replace-cross 要旨：最近の体素別マルチモーダル脳エンコーディング研究により、マルチモーダル大規模言語モデル（MLLM）は、単一モードモデルと比較してより高い脳同期性を示していることが示されています。さらに、指示調整されたマルチモーダル（IT）モデルは、特定のタスク向けの表現を生成し、それが脳活動と強く一致することが明らかになっています。ただし、大多数の以前の評価は、単一モードの刺激またはマルチモーダル刺激の下における指示調整されていないモデルに焦点を当てていました。我々は、指示調整が IT-MLLM における表現を機能的なタスク要件を中心に組織化するか、それとも単に表面語義を反映するかに明確な理解を持っていません。これを解決するために、私々は MLLM 表現から、自然主義映画視聴（動画と音声）中に記録された fMRI レスポンスを予測して脳同期性を推定しました。動画および 2 つの音声を持つ 6 つの IT-MLLM からの指示固有埋め込みを用い、13 つの動画タスク指示にわたって、指示調整された動画 MLLM は、文脈学習（ICL）マルチモーダルモデル（約 9% 上回る）、指示調整されていないマルチモーダルモデル（約 15% 上回る）、および単一モードベースライン（約 20% 上回る）を有意に上回ることがわかりました。動画および音声タスク、そして言語導向プロbing を行った我々の MLLM 評価は、脳領域間で変動する異なるタスク固有の MLLM 表現を生み出しました。さらに、ICL モデルは強い語義的組織化（r=0.78）を示し、IT モデルは指示テキストの語義と弱い結合（r=0.14）を示し、これはより高い脳同期性に関連するタスク条件付きサブスペースと一致しています。これらの発見は、タスク固有指示とより強い脳-MLLM 同期の関連性を支持し、両システムにおける統合情報処理のマッピングのための新たな道を開きました。私々はコードを公開的に提供しました [https://github.com/subbareddy248/mllm_videos]。

Original Content

arXiv:2506.08277v2 Announce Type: replace-cross Abstract: Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs significantly outperform in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].