arxiv_cs_cv 2026年4月24日

視覚なしでの認識：ウェアラブル IMU から行う 4D 人間・シーン理解

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Translated: 2026/4/24 19:47:32

imux4d-understandingwearable-sensorshuman-motionlarge-language-models

Japanese Translation

arXiv:2604.21926v1 Announce Type: new 要約：人間の活動とその周囲環境を理解するためには通常視覚認識が依拠されますが、カメラはプライバシー、安全性、効率性、そしてスケーラビリティの面で恒久的な課題を提起しています。我々は代替案としての、視覚なしでの 4D 認識を探求しています。その目的は、日常生活のウェアラブルセンサーから純粋に人間の動きと 3D シーンレイアウトを再構築することです。このために、私たちは大型言語モデルを非視覚的时空的観点からの人間・シーンスパイシャルな理解のために再利用するためのフレームワーク、IMU-to-4D を導入しました。IMU-to-4D は、イヤフォン、スマートウォッチ、スマートフォンなどの少数の慣性センサーからのデータを活用し、詳細な 4D 人間の動きおよび粗いシーン構造を予測します。多様な人間・シーンデータセットにおける実験により、IMU-to-4D は最先端の重ね合わせたパイプラインよりも一貫性が高く、時間的に安定した結果を生み出すことが示されており、ウェアラブル運動センサーのみが豊かにした 4D 理解をサポートできることが示唆されています。

Original Content

arXiv:2604.21926v1 Announce Type: new Abstract: Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.