arxiv_cs_cv 2026年2月10日

LatentLens: LLM において高度に解釈可能な視覚トークンを明らかにする

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Translated: 2026/3/15 16:07:45

latentlensvision-language-modelinterpretable-aivisual-tokensllm-vision

Japanese Translation

arXiv:2602.00462v2 Announce Type: replace 要約: 大規模言語モデル (LLM) を視覚言語モデル (VLM) へと変換するためには、視化エンコーダーからの視覚トークンを LLM の埋め込み空間に変換する必要がある。興味深いことに、この変換は浅い MLP 変換さえ行うだけでも十分である。LLM が視覚トークンをどのように容易に処理できるのかを理解するためには、LLM の処理の各層にわたって視覚トークン表現に何がエンコードされているかを明らかにする解釈可能性の手法が必要である。本稿では、潜在表現を自然言語の説明へとマッピングするための新しい手法 LatentLens を導入した。LatentLens は、大規模テキストコーパスをエンコードし、そのコーパス内の各トークンに対して文脈に依存したトークン表現を保存することで動作する。次に、視覚トークン表現はそれに対応する文脈に依存したテキスト表現と比較され、k 個の最近傍の表現が視覚トークンの説明を提供する。本手法を 10 種類の異なる VLM で評価した結果、LogitLens などの一般的な手法は、視覚トークンの解釈可能性を著しく過小評価することが明らかとなった。LatentLens を用いる場合、研究されたすべてのモデルおよびすべての層において、過半の視覚トークンは解釈可能であった。質的な観点から、LatentLens が生成した説明は概念的に意味があり、個別のトークンと比較して人間に対してより微細な解釈を提供することが示された。より広く、我々の発見は視覚と言語表現の間に新しい証拠を提供し、潜在表現の分析に関する新たな方向性を開くものである。

Original Content

arXiv:2602.00462v2 Announce Type: replace Abstract: Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.