arxiv_cs_cv 2026年4月24日

Adaptive Patch Sizes による Vision Transformer の高速化

Accelerating Vision Transformers with Adaptive Patch Sizes

Translated: 2026/4/24 19:50:07

vision-transformerspatch-embeddingmachine-learningcomputer-visionneural-networks

Japanese Translation

arXiv:2510.18091v2 Announce Type: replace 要約: Vision Transformers (ViT) は、内容に関係なく入力画像を均一サイズのパッチに分割し、高解像度の画像では非常に長い入力シーケンス長さをもたらします。われらは、同一画像内で複数の異なるパッチサイズを使用することでこれを解決する、Adaptive Patch Transformers (APT) を提案します。APT は、より均質な領域では大きなパッチサイズ、より複雑な領域では小さなパッチサイズを割り当てることで、入力トークンの総数を削減します。APT は ViT-L と ViT-H で ViT 推論およびトレーニングの大幅な高速化を実現し、通量を増加させることで 40%（ViT-L）と 50%（ViT-H）を達成しつつ、下流パフォーマンスを維持します。また、事前学習済みの ViT に適用可能であり、1 エポックで収束することが可能です。さらに、高解像度の稠密な視覚タスクにおいて性能を損なうことなくトレーニングおよび推論時間を大幅に削減し、視覚 Q&A、オブジェクト検出、セマンティックセグメンテーションにおいて最大 30% の高速化を実現しました。

Original Content

arXiv:2510.18091v2 Announce Type: replace Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.