arxiv_cs_ai 2026年4月24日

生成推薦におけるクロスモーダル対齐による深層的兴趣採掘と SemanticID 生成

Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation

Translated: 2026/4/24 20:20:16

generative-recommendationsemantic-id-generationvision-language-modelsreinforcement-learningcross-modal-alignment

Japanese Translation

arXiv:2604.20861v1 告知タイプ：cross 要約：生成推薦（Generative Recommendation）は、Semantic IDs（SIDs）を活用した次トークン予測パラダイムにおいて、テラデータ規模のデータを学習可能な単語列に圧縮することで顕著な性能を示しました。しかし、既存の手法は 3 つの重大な限界に直面しています：(1) 情報劣化：2 段階の圧縮パイプラインがセマンティックな損失や情報劣化を引き起こし、高品質な SID と低品質な SID を区別する後続のメカニズムが存在しないこと。 (2) セマンティック劣化：逐次量定化によって元のマルチモーダル機能から重要なセマンティック情報が丢弃され、埋め込み生成段階と量定化段階が統一された目的に対して共同最適化されていないこと。 (3) モーダル歪み：量定化子がテキストと画像のモーダルを適切に対齐できておらず、上流のネットワークが対齐済めていても機能対齐が失敗する。これらの課題に対処するために、われわれは 3 つの主要なイノベーションを統合した新しいフレームワークを提案しました：深層的文脈的興趣採掘（DCIM）、クロスモーダルセマンティック対齐（CMSA）、品質意識的強化学習メカニズム（QARM）。まず、われわれはビジョン・ランゲージモデル（VLMs）を活用し、テキスト以外のモーダルを統一的なテキストベースセマンティック空間に整合させ、モーダル歪み軽減を図ります。次に、われわれは、広告文脈に潜む高次セマンティック情報を暗黙的に捉える深層興趣採掘メカニズムを導入し、再構築に基づく監督を通じて SID が重要な文脈情報を保持することを奨励します。さらに、われわれは品質意識的報酬を持つ強化学習フレームワークを採用し、セマンティックに豊富な SID を奨励し、後続段階で低品質な SID を抑制します。大規模な実験により、われわれのアプローチは State-of-the-art である SID 生成方法を一貫して凌駕し、複数のベンチマークにおいて卓越した性能を示すことが確認されました。脱落実験（Ablation studies）は、各提案されたコンポーネントの効果をさらに検証しました

Original Content

arXiv:2604.20861v1 Announce Type: cross Abstract: Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component