arxiv_cs_ai 2026年4月24日

GeoRA: RLVR 向け幾何構造感知低ランク適応法

GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR

Translated: 2026/4/24 20:34:04

geo-rarlvrlow-rank-adaptationreinforcement-learningmodel-fine-tuning

Japanese Translation

arXiv:2601.09361v3 Announce Type: replace-cross Abstract: 検証可能な報酬（Verifiable Rewards）を伴う強化学習（RLVR）は、大規模推論モデルの性能向上において重要なパラダイムである。上付き学習（Supervised Fine-Tuning, SFT）とは異なり、RLVR は別々の最適化ダイナミクスを示し、事前訓練された幾何構造の保存に対して敏感である。しかし、既存のパラメータ効率的な手法はこの режим（環境）で主要な制限を抱えている。低ランク適応手法、例えば PiSSA は、主に SFT に対して設計されており、RLVR の別々の最適化ダイナミクスや幾何構造を考慮していない。一方、RLVR が好む非構造化のス Pars（パラメータ）サブスペースを直接適応させることで遭遇する効率の限界（bottlenecks）を、GeoRA（幾何構造感知低ランク適応法）という、RLVR に特化した低ランク適応手法が解決する。具体的には、GeoRA は RL 更新サブスペースの異方性と圧縮性の構造を利用して、固有値分解（SVD）を通じて低ランクアダプターの主方向を抽出し、残存成分を構造のアンカーとしてトレーニング中に固定する。この設計は事前訓練された構造を保ち、効率的な高密度計算を可能にする。1.5B から 32B パラメータまでの Qwen と Llama モデルにおける実験は、GeoRA が RLVR 設定において数学、医学、プログラミングの分野で強力な低ランクベースラインを圧倒的に上回っていることを示し、同時に出題外タスクにおいてより強力な汎用性と学習忘却の減少を示している。

Original Content

arXiv:2601.09361v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a key paradigm for improving large-scale reasoning models. Unlike supervised fine-tuning (SFT), RLVR exhibits distinct optimization dynamics and is sensitive to the preservation of pre-trained geometric structures. However, existing parameter-efficient methods face key limitations in this regime. Low-rank adaptation methods, such as PiSSA, are primarily designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Conversely, directly fine-tuning the unstructured sparse parameter subspace favored by RLVR encounters efficiency bottlenecks on modern hardware. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), a low-rank adaptation method tailored for RLVR. Specifically, GeoRA exploits the anisotropic and compressible structure of RL update subspace, and extracts its principal directions via Singular Value Decomposition (SVD) to initialize low-rank adapters, while freezing residual components as a structural anchor during training. This design preserves the pre-trained structure and enables efficient dense computation. Experiments on Qwen and Llama models from 1.5B to 32B parameters show that GeoRA consistently outperforms strong low-rank baselines across RLVR settings in mathematics, medicine, and coding, while showing stronger generalization and less forgetting on out-of-domain tasks.