arxiv_cs_cv 2026年4月20日

AEGIS: 知能維持型ビジョン・言語・アクション微調整のためのアンカー制約勾配分離

AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

Translated: 2026/4/20 10:48:32

vision-language-modelsroboticsgradient-isolationvision-question-answeringcontinuous-learning

Japanese Translation

arXiv:2604.16067v1 発表タイプ：クロス概要：フローマッチングによるアクションエキスパートから高い勾配規模の連続勾配を注入することにより、クロースエン트로ピー（CE）のみで訓練されたバックボーンに視覚言語モデル（VLM）をロボット制御に適用する必要があります。このクロスモーダル勾配の非対称性——すなわち、低ランク MSE 回帰勾配と CE 前訓練で形作られた高次数義数表現と之間的次元多様性の不一致——は、VLM の視覚質問応答（VQA）能力を急速に、かつ深刻に侵食させます。業界標準の防御策では、勾配経路を完全に切断するもの（stop gradient）があり、豊富な連続監視を放棄するほか、パラメータ容量を制限する低ランクアダプター（LoRA）を採用するものがあり、更新の方向を制約するのみで更新の方向を制約するだけであり、前訓練された義数表現を依然として書き換えてしまいます。我々は、AEGIS（アンカー制約勾配分離システム）を提案します：これは、前訓練された VQA 義数表現を維持しつつ直接の連続 MSE 学習を可能にする、バッファ不要でレイヤーごとの直交勾配投影フレームワークです。AEGIS はすべてのトランスフォーマーレイヤーに渡りマスキングされた VQA 前向パスから静的な高斯分布参照アンカーを事前に計算し、各訓練ステップでワーストン距離 2 輸送罚則を構築してアンカー修復勾配を生成します。AEGIS はタスク勾配とアンカー勾配を並列的に 2 段階で分解し、各トランスフォーマーレイヤーに対して 1 つのグラム・シュmidt 直交投影を適用して、破壊的方向からタスク勾配を偏らせる一方で建設的内容を維持します。この投影は平均 1% 以下の勾配エネルギーを捨て却しますが、累積的な活性化ドリフトを除去し、それは深刻な忘却を引き起こします。

Original Content

arXiv:2604.16067v1 Announce Type: cross Abstract: Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM's visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.