arxiv_cs_cv 2026年2月10日

ManiVID-3D: 分岐された 3D 表現を通じた観視点不変性の強化学習によるロボティクス操作の一般化

ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations

Open original article

Translated: 2026/3/15 17:03:14

reinforcement-learningrobotic-manipulation3d-representationview-invariancecomputer-vision

Japanese Translation

arXiv:2509.11125v2 Announce Type: replace-cross Abstract: 実世界の操作タスクに視覚的強化学習 (RL) ポリシーを展開する際、カメラからの観視点の变化はしばしば妨げとなります。一定の前方カメラで訓練されたポリシーが、カメラがシフトされた場合に失敗することは避けられない実際の環境において、センサーの配置を適切に管理することは困難です。既存の方法は多くの場合、正確なカメラ補正に依存するか、大きな視点変化に苦戦しています。これらの制約に対処するために、我々はロボティクス操作のための新しい 3D RL アーキテクチャである ManiVID-3D を提案します。このアプローチは、自己教師ありの分岐特性学習を通じて観視点不変表現を学習します。本フレームワークは、ViewNet という軽量で効果的なモジュールを備えており、外補正なしに任意の視点からの点群観測を統一した座標システムに自動的に整列させます。さらに、5000 フレーム毎秒以上処理可能である効率的な GPU アクセラレートを備えたバッチレンダリングモジュールを開発し、画期的な速度で 3D 視覚 RL の大規模訓練を可能にしました。10 つのシミュレーションタスクと 5 つの実際のタスクにわたる広範な評価は、視点变化下において我々のアプローチが最良の方法と比較して 40.6% 高い成功率を達成することを示しました。また、パラメータ数を 80% 削減したままです。本システムは、観視点変化に対する堅牢性と、強いシミュレーションから実世界への転換性能を示しています。これらは、非構造化環境におけるスケーラブルなロボティクス操作のための幾何学的に整合性のある表現の学習の効果を強調しています。

Original Content

arXiv:2509.11125v2 Announce Type: replace-cross Abstract: Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted -- an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 40.6% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments.