arxiv_cs_cv 2026年2月10日

同時触覚・視覚感知を用いたマルチモーダルロボーマニピュレーション学習

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Translated: 2026/3/15 17:03:44

robotic-manipulationmulti-modal-sensingimitation-learningtactile-trackingdiffusion-policy

Japanese Translation

arXiv:2512.09851v2 Announce Type: replace-cross Abstract: ロボットマニピュレーションでは、複雑な現実世界タスクを処理するために豊富なマルチモーダル感知と効果的な学習フレームワークの両方が必要です。触覚と視覚感知を統合した透過皮膚（ST: See-Through-Skin）センサーは有望な感知能力を提供しており、現代の真似学習は政策取得のための強力なツールを備えています。しかし、既存の STS デザインは同時マルチモーダル感知を欠き、触覚トラッキングの信頼性に欠けます。さらに、これらの豊かなマルチモーダル信号を学習ベースのマニピュレーションパイプラインに統合することは未解決の課題です。本研究では、同時視覚感知と堅牢な触覚信号抽出を可能にする STS センサー「TacThru」、そしてこれらのマルチモーダル信号をマニピュレーションに活用する真似学習フレームワーク「TacThru-UMI」を導入しました。Our センサーは完全透明なエラストマー、恒久的な照明、革新的なキーラインマーカー、効率的なトラッキングを特徴とし、学習システムはトランスフォーマーベースの拡散政策を通じてこれらの信号を統合します。5 つの課題を伴う現実世界タスクにおける実験结果显示、TacThru-UMI は平均成功率 85.5% を達成し、触覚政策ベースライン（66.3%）と視覚のみ政策ベースライン（55.4%）を大幅に超越しました。システムは、薄く柔らかい物体との接触検出やマルチモーダル協調を必要とする精密マニピュレーションを含む重要なシナリオで優れています。本研究は、同時マルチモーダル感知を現代の学習フレームワークと組み合わせることで、より精密で適応可能なロボットマニピュレーションが可能であることを示しています。

Original Content

arXiv:2512.09851v2 Announce Type: replace-cross Abstract: Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of tactile policy(66.3%) and vision-only policy (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.