arxiv_cs_cv 2026年4月24日

Sapiens2

Translated: 2026/4/24 19:45:44

sapiens2computer-visiontransformersself-supervised-learningvision-language-models

Japanese Translation

arXiv:2604.21681v1 発表タイプ：新規要約：我々は、人類中心の視覚に焦点を当てて、汎用性、多機能性、および高解像度出力を特徴とする高解像度トランフォーマーのモデルファミリー、Sapiens2 を提示します。我々のモデルサイズは、04 億から 50 億のパラメータにわたります。ネイティブの 1K 解像度と、4K をサポートする階層型バリエーションを含む、Sapiens2 はその先駆モデルに対して、事前学習と事後学習の両面で大幅な改善をもたらします。まず、濃密予測用低レベルの詳細を捉える特徴と、ゼロショットまたは少数ラベル設定用高レベルの意味を学習するために、マスクされた画像再構築と自己教師あり対比目的を組み合わせました。我々の評価は、この統一的な事前学習目的が、より幅広いダウンストリームタスクに適していることを示しています。次に、データ軸において、我々は 10 億画の高品質な人間画像から構成されたデータセット上で事前学習を行い、タスクのラベル付けの品質と量も向上させました。第三に、アーキテクチャにおいて、より長い学習スケジュールと改善された安定性を可能にする先端的モデルの進歩を取り入了。我々の 4K モデルは、長い空間文脈を扱うためにウィンドウ付き注意を採用し、2K の出力解像度で事前学習されています。Sapiens2 は、状態の最良値を新設し、1 代目のモデルに対してポーズ（+4 mAP）、部分身体セグメンテーション（+24.3 mIoU）、ノーマル推定（角度誤差 45.6% の低下）で改善し、ポイントマップやアロベ推定などの新しいタスクまで拡張します。コード：https://github.com/facebookresearch/sapiens2

Original Content

arXiv:2604.21681v1 Announce Type: new Abstract: We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2