arxiv_cs_cv 2026年2月10日

Focus-Scan-Refine: 人的視覚認識から効率的なビジュアルトークン剪定へ

Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning

Translated: 2026/3/15 16:08:27

vision-language-modelstoken-pruningcomputer-visionmachine-learninginference-optimization

Japanese Translation

arXiv:2602.05809v2 発表タイプ: 代替要旨：視覚言語モデル（VLM）は、推論の遅延とメモリーフットプリントを大きく増大させる大量のビジュアルトークンを生成する傾向があります。トレーニング不要のトークン剪定が実用的な解決策を提供しているにもかかわらず、既存の方法は依然として、強力な圧縮下で局所的証拠と全球的なコンテキストとのバランスを取ることに苦労しています。私たちは、人間の視覚的質問への回答を模倣する（重要な証拠に焦点を当て、必要に応じて globally スキャンし、関連する詳細を統合してスキャンされたコンテキストを再編する）人間感銘のプラグアンドプレイ剪定フレームワーク、Focus-Scan-Refine（FSR）を提案します。FSR は、視覚的重要性と指示の関連性を組み合わせて、視覚的に顕著だがクエリに無関係な領域へのバイアスを回避することで、まず重要な証拠に焦点を当てます。その後、焦点を当てたセット条件下で補完的なコンテキストを探すために、焦点を当てた証拠と最も異なるトークンを選択します。最後に、FSR は類似性ベースの割り当てとスコアウェイトによる合併を通じて、トークン预算を増加させずに、スキャンアンカーに近傍の情報あるトークンを集約してスキャンされたコンテキストを再編します。複数の VLM バックボンとビジュアル言語ベンチマークをまたいだ広範な実験では、FSR が既存の最前端剪定方法よりも一貫して精度と効率のトレードオフを改善することが示されました。ソースコードは https://github.com/ILOT-code/FSR で確認できます。

Original Content

arXiv:2602.05809v2 Announce Type: replace Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR.