arxiv_cs_cv 2026年2月10日

SVD-ViT: SVD がビジョントランスフォーマーのフォーカスに前景を向けるか

SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

Translated: 2026/3/15 16:08:02

svdvision-transformerattention-mechanismdeep-learningobject-detection

Japanese Translation

arXiv:2602.02765v2 Announce Type: replace Abstract: ビジョントランスフォーマー（ViT）は現在、大規模な基礎モデルとして確立されています。しかし、自己注意機構はглобальным（グローバルに）作用するため、前景と背景を明確に区別する明確なメカニズムを備えていません。その結果、ViT は不要な背景特徴やアーティファクトを学習し、分類性能の低下を引き起こす可能性があります。この課題に対処するために、SVD-ViT を提案しました。SVD-ViT は、奇異値分解（SVD）を活用して前景特徴の学習を優先します。SVD-ViT は 3 つのコンポーネント—**SPC モジュール**、**SSVA**、および **ID-RSVD**—を有し、背景ノイズやアーティファクトといったタスク irrelevant な要因を抑制します。これには、対象の前景情報を捉える奇異ベクトルを抽出し、それらを合流させるプロセスが含まれます。実験結果は、私たちの手法が分類精度を向上させ、情報量の多い前景表現を効果的に学習させるとともに、背景ノイズの影響を軽減していることを示しています。

Original Content

arXiv:2602.02765v2 Announce Type: replace Abstract: Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.