arxiv_cs_lg 2026年2月10日

tLoRA: 弾性共有スーパーモデルを活用した効率的なマルチ LoRA 訓練

tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models

Translated: 2026/3/15 13:05:43

lorallm-finetuningdistributed-traininggpu-utilizationmachine-learning

Japanese Translation

LoRA（Low-Rank Adaptation）が大規模言語モデルの効率的なファインチューニングの標準的なアプローチとなってきた現在、共有クラスターは同一のフリーズ済みバックボーンに対して多数の並行した LoRA 訓練ジョブを実行する傾向にあります。最近の進展により、サージング時に複数のアダプターのコLOCATION（バッチ化）が可能になっても、異種 LoRA アダプターの訓練時の効率的な共同配置には独自の課題が残っています。ジョブ間ではアダプターのランク、バッチサイズ、リソース割り当てが異なり、単純なバッチ化は同期停止や通信オーバーヘッドを招き、それぞれジョブごとに独立して実行するよりも低下したパフォーマンスをきたす可能性があります。私たちは tLoRA というフレームワークを導入し、複数の LoRA ジョブを効率的にバッチ訓練させることを可能にしました。tLoRA は同一のベースモデルを共有するアダプターを弾性共有スーパーモデルに融合させ、既存の分散訓練フレームワークを活用してリソースを効果的に共有する並列性計画を導き出します。カーネルレベルでは、tLoRA はアダプター間で計算と通信のオーバーラップを最大化するように、低ランク計算のタイルを適応的に再構成し、ランク感知なナノバッチをスケジューリングする融合 LoRA カーネルを採用します。スケジューリングレイヤーでは、tLoRA は残存容量を考慮したオンラインスケジューラーを組み込み、集約的なスループットを最大化するようにジョブをグループ化します。実世界のクラスター・トレースを用いた評価において、tLoRA は訓練スループットを 1.2〜1.8 倍、ジョブの訓練完了時間を 2.3〜5.4 倍、GPU 利用を 37% 向上させることを示唆しました。

Original Content

arXiv:2602.07263v1 Announce Type: new Abstract: As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and na\"ive batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.