arxiv_cs_cv 2026年2月10日

LUCID-SAE: 可解釈な概念発見のための統合的なビジョン・ランゲージスパースコード学習

LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery

Translated: 2026/3/15 18:02:40

lucid-saesparse-autoencodermultimodal-learninginterpretable-aivision-language-models

Japanese Translation

arXiv:2602.07311v1 発表タイプ：新しい要約：スパースオートエンコーダー（SAE）は、異なる表現空間間で比較可能な説明を提供する自然な経路を開示します。しかし、現在の SAE は各模態ごとに訓練され、その辞書の機能は直接理解不能であり、説明はドメインを超えて移転できません。本研究では、画像パッチおよびテキストトークンの表現に対する共有潜在辞書を実際に学習しつつ、模態固有の詳細用にプライベートカパシティを保持する統合的なビジョン・ランゲージスパースオートエンコーダーである LUCID（可解釈な概念発見のための学習された統一ビジョン・ランゲージスパースコード）を提案します。われわれは、ラベルなしの状態で共有コードを学習された最適な転送matching 目的と結合させることによって特徴対齐（feature alignment）を実現しました。LUCID は可解釈な共有特性を生み出し、パッチレベルの Grounding をサポートし、クロス模態ニューロンの対応を確立し、類似性に基づく評価における概念クラスター化問題に対する強靭性を向上させます。この整合性特性を活用し、我々は手動の観測なしに用語クラスタリングに基づく自律的な辞書解釈パイプラインを開発しました。我々の解析は、LUCID の共有特性が物体だけでなく、アクション、属性、抽象的概念など多様な意味カテゴリを捉えていることを明らかにし、可解釈なマルチモーダル表現の包括的なアプローチを示しています。

Original Content

arXiv:2602.07311v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.