arxiv_cs_cv 2026年2月10日

ViT-5：現代中期の 2020 年代向けヴィジョントランスフォーマー

ViT-5: Vision Transformers for The Mid-2020s

Translated: 2026/3/15 19:04:27

vision-transformersdeep-learningcomputer-visiondiffusion-modelsarxiv

Japanese Translation

arXiv:2602.08071v1 発表タイプ：新規本稿では、過去 5 年間のアーキテクチャ進歩を活かし、現代のヴィジョントランスフォーマー（ViT）をシステム的に近代化することを実証した。画一的な「Attention-FFN」構造を保ちつつ、ノーマライゼーション、活性化関数、位置エンコーディング、ゲート機構、そして学習可能なトークンといったコンポーネント単位で微調整を行った。これらの更新は、我々が「ViT-5」と命名する新世代のヴィジョントランスフォーマーを形成した。広範な実験により、ViT-5 は既報の単純なヴィジョントランスフォーマーに比べ、理解と生成の両方のベンチマークで一貫して高い性能を発揮することが示された。ImageNet-1k クラシファイションにおいて、同等の計算リソース下で ViT-5-Base が 84.2% の Top-1 精度を達成し、DeiT-III-Base の 83.8% を上回った。ViT-5 は生成モデル化のためのより強力なバックボーンとしても機能する：SIT ディフュージョンフレームワークに統合すると、凡性の ViT バックボーンの 2.06 から 1.84 の FID（生成評価）を達成する。ヘッドライン指標を超えて、ViT-5 は更なる表現学習能力を示し、利便的な空間推論行動を呈し、タスク間での転送性も確実である。現代のファウンデーションモデルの実践に整合した設計であるため、ViT-5 は 2020 年代後半のヴィジョンバックボーンに対して、凡性の ViT に対するシンプルで即座に導入可能なアップグレードを提供する。

Original Content

arXiv:2602.08071v1 Announce Type: new Abstract: This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.