arxiv_cs_cv 2026年4月24日

AttentionBender: クロス・アテンションの操作を用いたビデオ拡散トランスフォーマーにおけるクリエイティブ・プローブ

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

Translated: 2026/4/24 19:47:50

attentionbendervideo-generationdiffusion-modelscross-attentionexplainable-ai

Japanese Translation

arXiv:2604.20936v1 発表タイプ：cross アブストラクト: 私たちは、ビデオ拡散トランスフォーマーのクロス・アテンション（Cross-Attention）を操作し、アーティストがブラックボックスであるビデオ生成の内部機構を探究するのを助けるツールとして AttentionBender を提示します。生成された出力はますます実写に近いようになっていますが、プロンプトに基づく制御のみでは、モデルの素材処理についての直観を構築したり、そのデフォルトの傾向を超えて活動したりする能力が制限されています。自己生来の研究による設計（Research-through-design）アプローチを用い、Net-work Bending を拡張して AttentionBender を設計しました。これは、クロス・アテンションマップに 2 次元変換（回転、スケール、平移動など）を適用して生成を調整します。私たちは、プロンプト、操作、層 targets を問わず 4,500 個を超えるビデオ生成を可視化して AttentionBender を評価しました。われわれの結果は、クロス・アテンションが非常に絡み合っていることを示唆しています。標的への操作は、通常はクリーンで局所制御を妨げ、線形的な編集よりも分散歪みとギッチャエスティシックス（glitch aesthetics）を生み出します。AttentionBender は、トランスフォーマーの注意機構を調査する説明可能な AI 風プローブであり、同時にモデルの学習済みの表現空間を超えた新規的美学を生み出すためのクリエイティブ・テクニックとして機能するツールを提供します。

Original Content

arXiv:2604.20936v1 Announce Type: cross Abstract: We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists' ability to build intuition for the model's material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model's learned representational space.