arxiv_cs_ai 2026年2月10日

STEP: 協調動作予測を用いた温め付けられた視覚運動政策

STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

Translated: 2026/3/7 13:27:17

diffusion-modelvisuomotor-controlspatiotemporal-consistency

Japanese Translation

最近、diffusion policyはロボットの操作に適用されることが可能な視覚運動制御の強力なパラダイムとして現れてきています。それは行動一連の分布と多モーティリティをモデル化する能力があります。しかし、デノイジングによる反復は重大な推論遅延を引き起こしリアルタイムクロージド・ループシステムでの制御周期を拡大します。既存のスピードアップメソッドはサンプリングステップを減らしたり直感的に推定を取り替えたり再実行する現行動をリユースすることで、行動品質と一貫した遅延を同時に保つことに挣扎しています。この仕事では、STEPという簡単にサプライティム・コンistenciaー予測メカニズムを提案しました。それは高品質の温め付けられた行動から作られ、目標動作との分布差が小さく、時系列的な一致性を維持しながら、直感的推定の元となるデシミスト化政策の生成能力に従わせません。さらにベロシティ覚知の挿入メカニズムを提案しましたこれが実行スタールを防止するように、時間的な操作変異に基づいて適応的にアクチュアラティの振動を制御します。この仕事では、プロポsedされた予測が局部的合同化写像を作り出し、推定過程における動作誤差の収束を保証することが理論的に分析されました。さらに9つのシミュレーションベンチマークと2つの実世トランザクションについて大量な評価を行いました。 STEPはそれぞれに約15点ずつパンドロイド・ビンジャーやリアルワールドタスクについてBRIDGERとDDIMよりも成功確率が高くなります。これらの結果からSTEPは既存の手法よりも推論遅延と成功率のポアソン frontierを一貫的に進歩させています。

Original Content

arXiv:2602.08245v1 Announce Type: cross Abstract: Diffusion policies have recently emerged as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose STEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware perturbation injection mechanism that adaptively modulates actuation excitation based on temporal action variation to prevent execution stall especially for real-world tasks. We further provide a theoretical analysis showing that the proposed prediction induces a locally contractive mapping, ensuring convergence of action errors during diffusion refinement. We conduct extensive evaluations on nine simulated benchmarks and two real-world tasks. Notably, STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks, respectively. These results demonstrate that STEP consistently advances the Pareto frontier of inference latency and success rate over existing methods.