arxiv_cs_ai 2026年4月24日

CorridorVLA: 散点アンカーを用いた生成式アクションヘッドに対する明示的な空間制約

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

Translated: 2026/4/24 20:25:34

vlaroboticsgenerative-aiflow-matchingreinforcement-learning

Japanese Translation

arXiv:2604.21241v1 Announce Type: cross アブストラクト：視覚・言語・アクション（VLA）モデルは、多モーダル入力を連続制御に接続するための中間表現を使用することが多いが、空間的ガイダンスはしばしば潜在特性を介して暗黙的に注入されている。我々は、散点空間アンカーを増分物理変化（例：$ riangle$-位置）として予測し、これを用いてアクション生成の訓練目標において明示的な耐容領域を課すことを提案する。これらのアンカーは、フローマッチングアクションヘッドを導くトンネルを定義する：その空間進化がトンネルの外に落ちた軌道には訂正勾配が与えられるが、接点からの微小な偏離と実行ノイズは許可される。より難易度の高い LIBERO-Plus ベンチマークにおいて、CorridorVLA は SmolVLA と GR00T の両方で一貫した向上をもたらしく、対応するベースラインに比べて成功率を $3.4\ ext{\%}$〜$12.4\\text{\%}$ 向上させ、我々の GR00T-Corr バリアントは成功率 $83.21\\text{\%}$ を達成している。これらの結果は、アクションに整合した物理的示唆が、生成式アクション政策に対して直接的かつ解釈可能な制約を提供でき、視覚的または潜在的な形式にエンコードされた空間的ガイダンスを補完できることを示している。コードは https://github.com/corridorVLA で利用可能です。

Original Content

arXiv:2604.21241v1 Announce Type: cross Abstract: Vision--Language--Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose $CorridorVLA$, which predicts sparse spatial anchors as incremental physical changes (e.g., $\Delta$-positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by $3.4\%$--$12.4\%$ over the corresponding baselines; notably, our GR00T-Corr variant reaches a success rate of $83.21\%$. These results indicate that action-aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at https://github.com/corridorVLA.