arxiv_cs_cv 2026年2月10日

たった一度のポーズ推定：モノクシャル RGB からカテゴリレベルの 9D オブジェクトポーズ推定を行うミニマリスト検出転換器

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Open original article

Translated: 2026/3/15 9:02:43

roboticsmonocular-visual-odometrymulti-object-6d-estimationarxivtransformer

Japanese Translation

arXiv:2508.14965v2 発表型: 差し替え要約：単一の RGB 画像から未認識のインスタンスの全 9-Dof のポーズを正確に復元する問題は、ロボティクスと自動化の核心的課題である。既存の多くの解法は、まだ偽深度、CAD モデル、または 2D 検出とポーズ推定を分離するマルチステージの級連に依存している。カテゴリレベルで直接学習する、よりシンプルな RGB だけの代替手段の必要性を motivation として、私たちは以下のような longstanding 問題を見直した：オブジェクト検出と 9-Dof のポーズ推定は、追加のデータなしに高い性能で統合可能か？我々の手法 YOPO（単一ステージ、クエリベースの枠組み）を用いて、彼らはそれを示した。YOPO は、カテゴリレベルの 9-Dof 推定を 2D 検定の自然な拡張とみなす、検出器に軽量なポーズヘッド、バウンディングボックス条件付き変換モジュール、および 6D 認識的なハンガリー matching コストを付与する。モデルは RGB 画像とカテゴリレベルのポーズラベルのみを使ってエンドツーエンドで訓練された。ミニマリストの設計にもかかわらず、YOPO は 3 つのベンチマークで新しい状態の芸術（SOTA）を築いた。REAL275 データセットでは、79.6% の $\rm{IoU}_{50}$ と、$10^\circ$$10\rm{cm}$ メトリックで 54.1% を達成し、従来の RGB だけの手法を超え、RGB-D システムとのギャップを大幅に縮小した。コード、モデル、および追加的な定性的な結果は、https://mikigom.github.io/YOPO-project-page で入手できる。

Original Content

arXiv:2508.14965v2 Announce Type: replace Abstract: Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on https://mikigom.github.io/YOPO-project-page.