arxiv_cs_ai 2026年4月24日

Content-Based Music Recommendation の見直し: Large-Scale Music モデルから得た効率的な機能融合

Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

Translated: 2026/4/24 20:19:27

music-recommendationmultimodal-learningself-supervised-learningaudio-processingarxiv-papers

Japanese Translation

音楽レコメンデーションシステム（MRS）は、現代のストリーミングプラットフォームの根幹をなす技術である。既存のレコメンデーションモデルは、リコールとランク付けの両方の段階にわたって協同フィルタリングに主に依存しており、オーディオの固有特性を十分に活用できず、特に冷たいスタートのシナリオで性能が最適化されていないという課題を抱えている。さらに、既存の音楽レコメンデーションデータセットは、生オーディオ信号や記述的文書メタデータといったリッチなマルチモーダル情報に乏しいという問題がある。また、現在のレコメンデーションシステムの評価枠組みも、マルチモーダル情報を十分に活用できず、特にマルチモーダル手法を含む多様なアルゴリズムをサポートしないという点で不十分である。これらの制限に対処するため、私々はマルチモーダル情報の役割を浮き彫りにする目的で、音楽レコメンデーションに設計された包括的なデータセットとベンチマーク枠組みである TASTE を提案する。我々のデータセットは、オーディオとテキストの両方のモーダルを統合する。最新の大規模な自己教師あり音楽エンコーダーを活用し、抽出されたオーディオ表現が、候補リコールや CTR を含むレコメンデーションタスクにおいて持つ多大な価値を示す。さらに、我々はマルチレイヤーオーディオ機能をより効率的に統合することを可能にする\textbf{MuQ-token}手法を導入する。この手法は、各種設定において他の機能融合手法を常に上回る結果をもたらす。全体的に、我々の結果はコンテンツ驱动的アプローチの有効性を検証し、また今後の研究において非常に有効で再利用可能なマルチモーダル基盤を提供する。コードは以下の URL で入手可能です: https://github.com/zreach/TASTE

Original Content

arXiv:2604.20847v1 Announce Type: cross Abstract: Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large-scale self-supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the \textbf{MuQ-token} method, which enables more efficient integration of multi-layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content-driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at https://github.com/zreach/TASTE