arxiv_cs_cv 2026年2月10日

UniLiP: CLIP の統一された多モーダル理解、生成、編集のための適応

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Translated: 2026/3/15 8:02:24

unilipclipmultimodal-aivisual-generationvision-language-model

Japanese Translation

arXiv:2507.23278v3 Announce Type: replace 要約: 本稿では、CLIP を多モーダル理解、生成、および編集用に適応させる統一されたフレームワークである UniLIP を提案します。CLIP は理解能力において優れていますが、統一視覚エンコーダーとして求められる再構成能力に欠けています。しかし、従来の CLIP ベースの統一手法は理解と再構成のバランスを取らず、意味の退廃または一貫性のない再構成をもたらします。対照的に、私達は CLIP に高精度な再構成能力を付与しつつも、元の理解性能を維持しつつ、自己教師学習戦略を用いた革新的な 2 段階訓練スキームを導入します。生成と編集における理由推論と一貫性を高めるために、MetaQuery フレームワークに基づいたデュアル条件アーキテクチャを開発しました。我達のアーキテクチャは、多モーダルヒデンステートを用いた豊富な文脈情報を、さらに可習成 query エンベディングを用いて多モーダル大規模言語モデル（MLLM）の強力な理由推論能力を活用します。先進的なイメージ表現とアーキテクチャ設計を活用して、UniLIP は優れた指示従順性と編集忠実性を示しました。1B パラメータと 3B パラメータのみで、BAGEL（7B）や Uniworld-V1（12B）などの大きな統一モデルを上回り、GenEval 0.90、WISE 0.63、ImgEdit 3.94 という状態の最良性能を実現しました。これらの結果は、UniLIP が CLIP の応用範囲を広げ、単なる理解タスクのための最適選択であるだけでなく、生成および編集タスクにおいて高度に競合的な性能を発揮することを示しています。コードとモデルは https://github.com/nnnth/UniLIP 入手可能です。

Original Content

arXiv:2507.23278v3 Announce Type: replace Abstract: In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.