arxiv_cs_cv 2026年2月10日

TwistNet-2D: Spiral Twisting を用いた 2 次次元チャネル相互作用の学習によるテクスチャ認識

TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition

Translated: 2026/3/15 18:02:15

twistnet-2dtexture-recognitionchannel-interactiondeep-learningneural-networks

Japanese Translation

arXiv:2602.07262v1 Announce Type: new 要旨：2 次特性統計量はテクスチャ認識に不可欠です。しかし、現在の手法には基本的な矛盾が存在します。双線形プーリングおよびグラム行列は全体的なチャネル相関を捉えられますが、空間構造を崩壊させます。一方、自己注意モデルは重み付け合積を通じて空間的文脈を扱います。しかし、明示的なペアごとの特性相互作用ではありません。本稿では、方向性的空間シフト下で局所的なペアチャネル積を計算し、特性が共存する位置と相互作用を同時にエンコードする軽量モジュール「TwistNet-2D」を導入します。核心コンポーネントである Spiral-Twisted Channel Interaction (STCI) は、要素ごとのチャネル乗算前に特徴マップを指定された方向にシフトさせることで、構造的および周期的テクスチャに特徴的な位置間の共存パターンを捉えます。学習されたチャネル再重み付けを用いた 4 つの方向性ヘアドを統合し、結果を sigmoid-gated リザルダ通路に注入することで、TwistNet は ResNet-18 に比べてパラメータが 3.5%、FLOPs が 2% だけ増えつつも、ConvNeXt、Swin Transformer、および混合 CNN--Transformer アーキテクチャを含む、パラメータ同等あるいははるかに大きなベースラインを常に上回ります。これは 4 つのテクスチャおよび微細粒度認識ベンチマークで実証されています。

Original Content

arXiv:2602.07262v1 Announce Type: new Abstract: Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines -- including ConvNeXt, Swin Transformer, and hybrid CNN--Transformer architectures -- across four texture and fine-grained recognition benchmarks.