arxiv_cs_cv 2026年4月24日

潜伏性ノイズ除去が大型マルチモーダルモデルにおける視覚的整合性を向上させる

Latent Denoising Improves Visual Alignment in Large Multimodal Models

Translated: 2026/4/24 19:43:04

latent-denoisinglarge-multimodal-modelsvisual-alignmentneural-architecturemultimodal-understanding

Japanese Translation

arXiv:2604.21343v1 Announce Type: new 要旨：LLaVA などの大型マルチモーダルモデル（LMM）は、通常、自己回帰的な言語モデル化の目標でトレーニングされており、視覚トークンには間接的な監督のみが提供されます。これにより、内部視覚表現は弱くなり、分布シフト下では脆い動作を示す傾向があります。最近、高品質な視覚トークナイザーを学習するために latent denoising の進展が試みられています。この原則は、LMM の内部視覚特徴の整合性とマルチモーダル理解を改善するための効果的な形式の視覚監督を提供できると我々は示しています。我々は、salience-aware のマスキングとガウスノイズの混合を用いて投影された視覚トークンを腐乱させる latent denoising フレームワークを提案しました。LMM は、隠れた状態から選択された中間 LLM レイヤーでクリーンな教師用パッチ特徴を回復することで、これらの腐乱されたトークンを denoise するようトレーニングされます。表現の崩壊を防ぐために、我々のフレームワークは教師用の画像内類似構造を保ち、画像内対照パッチディストリルを適用します。推論時には、腐乱と補助ヘッドは無効化され、追加の推論時のオーバーヘッドが生じません。標準的なマルチモーダルベンチマックスの広範なセットを跨いで、我々の手法は強力なベースラインに対して一貫して視覚的理解および推論を改善し、構成堅牢性のベンチマーク（例：NaturalBench）では明確な改善をもたらしました。さらに、ImageNet-C スタイルの非対抗的一般的な腐乱がベンチマーク画像に適用された場合、我々の手法は高い精度を維持し、中度および重度の腐乱レベルで両方とも降圧が減少しました。コードは https://github.com/dhruvashp/latent-denoising-for-lmms に利用可能です。

Original Content

arXiv:2604.21343v1 Announce Type: new Abstract: Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.