arxiv_cs_cv 2026年2月10日

Vanilla Group Equivariant Vision Transformer: Simple and Effective

Translated: 2026/3/15 19:04:02

vision-transformerequivariancecomputer-visioninductive-biasattention-mechanism

Japanese Translation

arXiv:2602.08047v1 Announce Type: new Abstract: 対称性の事前知識をインдукティブバイアスとして取り入れることで対等なビジョントランスフォーマー（ViT）を設計するアプローチは、性能向上の有望な道筋となってきました。しかし、既存の対等な ViT は、ViT 内の多様なモジュール—特にパッチ埋め込みと自己注意機構の調和—において、性能と対等性のバランスを取り合わせる課題に直面することがあります。これを解決するために、私たちはパッチ埋め込み、自己注意、位置エンコーディング、および Down/Up-Sampling などの ViT の主要コンポーネントを体系的に対等化させ、保証された対等性を持つ ViT を構築するシンプルなフレームワークを提案します。得られたアーキテクチャは理論的に根付いているとともに実用的に汎用性に富み、Swing Transformers へとスムーズにスケーリングするプラグアンドプレイの置換要素となります。広範な実験が示すように、私たちの対等な ViT は、多数のビジョンタスクにおいて性能とデータ効率的を一貫して改善しています。

Original Content

arXiv:2602.08047v1 Announce Type: new Abstract: Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.