arxiv_cs_ai 2026年4月24日

楽譜理解ベンチマーク：大規模言語モデルの完全な楽譜の理解能力を評価する

Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores

Translated: 2026/4/24 20:33:28

musical-score-understandinglarge-language-modelsvision-language-modelsmusic-notationmultimodal-reasoning

Japanese Translation

arXiv:2511.20697v4 Announce Type: replace-cross 抽象: 完全な楽譜を理解するには、ピッチ、リズム、和声、そして大規模構造に関する統合的な推論が必要となります。しかし、大規模言語モデル（LLM）と視覚言語モデル（VLM）が完全な楽譜記号を解釈する能力は、まだ十分には研究されていません。私たちは、テキスト（ABC 記法）および視覚（PDF）の両方のモードにおけるスコアレベルの音楽的理解を評価するための、人間が整理・作成したベンチマークである「Musical Score Understanding Benchmark (MSU-Bench)」を提案しました。MSU-Bench は、バッハ、ベートーベン、ショパン、デビュッシーなどに至るまで、1,800 つの生成質問応答ペアを含み、開始情報のテキストから、テクチャ、そして形式に至るまで、4 つの段階で難易度が昇進する構成になっています。15 以上の最先端モデルに対する評価（ゼロショットおよび微调設定の両方）では、モードの間の明確なギャップ、レベルごとの不安定な性能、そして多段階の正解性を維持する困難さが顕著に示されました。微调はモード全体にわたって結果を大幅に改善し、一般性を保持すると同時に、MSU-Bench を多モーダル推論の将来の研究に robust な基礎として位置づけます。このベンチマークとコードは、https://github.com/Congren-Dai/MSU-Bench に公開されています。

Original Content

arXiv:2511.20697v4 Announce Type: replace-cross Abstract: Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.