arxiv_cs_cv 2026年2月10日

MUFASA: ViT エンコーダーの複数階層に基づくスロット注意のマルチレイヤーフレームワーク

MUFASA: A Multi-Layer Framework for Slot Attention

Translated: 2026/3/15 18:04:05

mufasaslot-attentionunsupervised-object-centric-learningvision-transformerobject-segmentation

Japanese Translation

arXiv:2602.07544v1 発表タイプ：新規要約：無教師のオブジェクト中心学習（OCL）は、視覚的なシーンを別々のエンティティに分解します。スロット注意は、個別のオブジェクトを潜在的ベクトル（スロット）として表現する一般的なアプローチです。現在の手法は、これらのスロット表現を前もって学習したビジョントランスフォーマー（ViT）の最終層のみから取得し、他の階層にエンコードされている貴重な、意味論的に富んだ情報を無視しています。この潜在の意味論的信息をより効果的に利用するために、私たちは無教師のオブジェクトセグメンテーションのスロット注意ベースのアプローチ向けの軽量で即座に使用できるフレームワークである MUFASA を導入します。私たちのモデルは、ViT エンコーダーの複数の特徴階層を横断してスロット注意を計算し、それらの意味論的な富みを完全に活用します。複数の階層で取得されたスロットを統合されたオブジェクト中心表現に集約するための融合戦略を提案しました。MUFASA を既存の OCL 手法に統合することで、複数のデータセットにおけるセグメンテーション結果が改善され、新しい状態の芸術（SOTA）を設定すると同時に、わずかな推論オーバーヘッドのみでトレーニングの収束速度を向上させます。

Original Content

arXiv:2602.07544v1 Announce Type: new Abstract: Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.