arxiv_cs_cv 2026年2月10日

イメージフィルタリングとブースティングの知見に基づいて Transformer を再考する

Revisiting Transformers with Insights from Image Filtering and Boosting

Translated: 2026/3/15 6:02:20

transformersmachine-learningself-attentionimage-processingdeep-learning

Japanese Translation

arXiv:2506.10371v2 発表型: 差し替え要旨: Transformer 型最先进深学習アーキテクチャの柱となっている自己注意力（self-attention）機構は、概率的に駆動され、根本的に解釈が困難です。そのため、その驚異的な成功と制限を説明する堅牢な理論的基礎を確立することが、最近の研究においてますます重要になっています。いくつかの注目すべき方向性は、自己注意力を理解する際にノイズ除去（image denoising）と非パラメトリック回帰の視点を採用してきました。これらは有望なものですが、既存のフレームワークは依然として、オリジナルの表現法およびその後のバリアントを含む各種アーキテクチャ的コンポーネントの自己注意力を強化する役割について、より深い機能的解釈を欠いています。本稿では、画像処理フレームワークの開発を通じて、この理解を進展させます。このフレームワークは、単に自己注意力計算を説明するだけでなく、位置エンコーディングや残差接続を含み、多数の后来的バリアントも含むコンポーネントの役割も説明できます。我々はさらに、このフレームワークに基づいて両概念間の潜在的な差異を特定し、このギャップを埋める努力を払っています。我々は Transformer 内で 2 つの独立したアーキテクチャ的変更を提案します。我々の主な目的は解釈可能性ですが、画像処理に由来する変更が、言語およびビジョンタスクにおいて、データ汚染と敵対的攻撃に対して著しく高い精度と頑健さ、およびより良い長系列理解を導き出すことが経験的に観察されています。

Original Content

arXiv:2506.10371v2 Announce Type: replace Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.