arxiv_cs_cv 2026年4月20日

SignX: コンパクトなポーズ豊富潜在空間における連続的な手話認識

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

Translated: 2026/4/20 10:49:32

sign-recognitionpose-estimationvideo-understandinglatent-spaceslr

Japanese Translation

arXiv:2504.16315v4 Announce Type: replace 要約: 手話 (SL) データ処理の複雑さには多くの課題が伴います。現在の手話 Sign 認識アプローチは、RGB 手話ビデオをポーズ情報を通じて単語ベースの ID Glosses（Sign の一意の識別子）に翻訳することを目的としています。本論文は、コンパクトなポーズ豊富潜在空間における連続手話認識 (SLR) のための新規なフレームワークである SignX を提案します。まず、 heterogenousポーズフォーマット（SMPLer-X、DWPose、Mediapipe、PrimeDepth、Sapiens セグメンテーション）を、コンパクトで情報密度の高い空間にエンコードする統合的な潜在表現を構築しました。次に、ViT ベースのビデオからポーズモジュールをトレーニングし、この潜在表現を生 Video から直接抽出しました。最後に、この潜在空間だけで動作する時系列モデル化とシークンスリファインメント手法を開発しました。このマルチステージ設計は、エンドツーエンドの SLR を実現すると同時に、計算消費を大幅に削減しています。実験結果は、SignX が連続 SLR と翻訳タスクで SOTA 精度を達成し、ピクセル空間ベースラインに比べてほぼ 50 倍の加速をもたらすことを示しています。

Original Content

arXiv:2504.16315v4 Announce Type: replace Abstract: The complexity of Sign Language (SL) data processing brings many challenges. The current approach to recognition of SL signs aims to translate RGB sign language videos through pose information into Word-based ID Glosses, which serve to uniquely identify signs. This paper proposes SignX, a novel framework for continuous sign language recognition (SLR) in compact pose-rich latent space. First, we construct a unified latent representation that encodes heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation) into a compact, information-dense space. Second, we train a ViT-based Video-to-Pose module to extract this latent representation directly from raw videos. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space. This multi-stage design achieves end-to-end SLR while significantly reducing computational consumption. Experimental results demonstrate that SignX achieves SOTA accuracy on continuous SLR and Translation task, delivering nearly a 50-fold acceleration over pixel-space baselines.