arxiv_cs_cv 2026年4月24日

SCASeg: 高速なセマンティックセグメンテーションのための Strip Cross-Attention

SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Translated: 2026/4/24 19:48:56

semantic-segmentationvision-transformercross-attentionefficient-learningdeep-learning

Japanese Translation

arXiv:2411.17061v2 発表タイプ：置換摘要：Vision Transformer (ViT) はコンピュータビジョンにおいて顕著な成功を収め、そのバリエーションはセマンティックセグメンテーションを含む多様な下流タスクで広く検証されています。しかし、汎用的な可視性エンコーダとして ViT バックボーンは、常にタスクデコーダーの特定の要件を完全に満たしていないため、効率的なセマンティックセグメンテーションに最適化されたデコーダーの設計に機会を示しています。本稿では、セマンティックセグメンテーション用に特化して設計された新しいデコーダーヘッドである Strip Cross-Attention (SCASeg) を提案します。従来のスキップ接続に頼るのではなく、エンコーダーとデコーダーの段階間の横接続を活用し、クロスアテンションモジュールにおいてエンコーダー機能をクエリとして使用します。さらに、複数のエンコーダーおよびデコーダー階層の階層的特性マップを統合してキーおよびバリューの統一された表現を形成するクロスレイヤーブロック (CLB) を導入します。CLB はまた、畳み込みの局所的な感覚の強みを取り込み、SCASeg が複数の層を超えて全局的なおよび局所的な文脈依存性を捉えることを可能にし、異なるスケールでの機能相互作用を強化して全体の効率性を向上させます。計算効率をさらに最適化するために、SCASeg はクエリとキーのチャンネルを 1 次元に圧縮し、ストリップ状のパターンを形成することで、従来のバンチクロスアテンションと比較してメモリアンザースを削減し、推論速度を増加させます。実験结果表明、SCASeg の適応的デコーダーは、ADE20K、Cityscapes、COCO-Stuff 164k、および Pascal VOC2012 などのベンチマークデータセットにおいて、主要なセグメンテーションアーキテクチャよりも優れたパフォーマンスを示し、多様な計算制約下でも同様の結果を出しました。

Original Content

arXiv:2411.17061v2 Announce Type: replace Abstract: The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general-purpose visual encoders, ViT backbones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross-attention modules. Additionally, we introduce a Cross-Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for Keys and Values. The CLB also incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers, thus enhancing feature interaction at different scales and improving overall efficiency. To further optimize computational efficiency, SCASeg compresses the channels of queries and keys into one dimension, creating strip-like patterns that reduce memory usage and increase inference speed compared to traditional vanilla cross-attention. Experiments show that SCASeg's adaptable decoder delivers competitive performance across various setups, outperforming leading segmentation architectures on benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under diverse computational constraints.