arxiv_cs_cv 2026年2月10日

Vision Transformers におけるブロック再帰的な動的挙動

Block-Recurrent Dynamics in Vision Transformers

Translated: 2026/3/15 16:06:40

vision-transformersrecurrent-neural-networksdepth-analysisdynamical-systemsneural-interpretability

Japanese Translation

arXiv:2512.19941v2 Announce Type: replace 摘要: Vision Transformers (ViT) が標準的なビジョンバックボーンとなりつつある今、それらの計算現象の機械的な説明が不可欠です。構造的な手がかりが動的構造を示唆するものの、Transformers の深さをよく特徴付けられた流れとして解釈する確立された枠組みは存在しません。本稿では、Block-Recurrent Hypothesis (BRH) を導入し、学習された ViT は、元の実行 block の数が $L$ であっても、それを正確に書き換えられるように、$k eq L$ の異なるブロックを再帰的に適用するブロック再帰的な深さ構造を抱持すると主張します。多様な ViT を用いた間層の表現類似性行列分析は、少数の連続したフェーズを示唆しています。これらのフェーズが真に再利用可能な計算を反映しているかを確認するために、事前学習された ViT に対して Block-Recurrent Surrogates を構築しました。これを Recurrent Approximations to Phase-structured TransfORmers (Raptor) と呼びます。小規模な実験では、Stochastic Depth とトレーニングが再帰的構造を促進することを示し、それによって私たちの Raptor の正確な適合能力と相関関係が確認されました。次に、BRH の実在性を証明するために、2つのブロックだけで DINOv2 ImageNet-1k ライナプローブ精度の 96% を復元できるよう Raptor モデルをトレーニングしました。最後に、我々の仮説を利用して Dynamical Interpretability の研究プログラムを発展させました。我々は i) クラス依存な角度の盆地への方向性の収束、小さな摂動の下での自己正解軌道、ii) トークン固有の動的挙動、つまり cls が鋭い終末の再方向化を実行し、patch トークンが自らの平均方向に強い終末期の同調を示す、iii) 終末深さでは低ランク更新への崩壊、これは低次元の吸引子への収束に一致する、という事実を見出しました。総合すると、ViT の深さに沿ってコンパクトな再帰プログラムが出現しており、これがこれらのモデルを機能的な動的系分析を通じて研究する可能となる低複雑度の規範的な解決策であることを示唆しています。

Original Content

arXiv:2512.19941v2 Announce Type: replace Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.