arxiv_cs_lg 2026年2月10日

V-ABFT：不整合に基づく適応閾値を用いた混合精度深層学習における耐障害性行列乗算

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

Translated: 2026/3/15 15:01:23

fault-tolerancematrix-multiplicationdeep-learningmixed-precisiongemm

Japanese Translation

arXiv:2602.08043v1 発表タイプ：新規摘要：アルゴリズムに基づく耐障害性（Algorithm-Based Fault Tolerance, ABFT）は、深層学習システムにおける柱となる行列乗算における沈黙的なデータ汚染（SDC）を検出するために広く採用されています。しかし、既存の閾値決定方法には重大な課題が存在します：解析的な閾値は過剰に保守的であり、確率的アプローチ（例：A-ABFT）は実際の丸め誤差より $160$--$4200 imes$ 大きい閾値を生成します。本研究では、検証差分を直接モデル化する不整合に基づいた適応閾値アルゴリズム、V-ABFT を提案します。V-ABFT は統計的不整合推定を利用することで、閾値に対する実際の誤差比を約 $7$--$20 imes$（FP32/FP64 に対して）および $48$--$158 imes$（BF16 に対して）と大幅に低減します。これは A-ABFT に対する約 $6$--$48$ imes の改善であり、BF16、FP16、FP32、FP64 のすべての精度で偽陽性率をゼロに保ちます。さらに、出力量子化の前に検証を行う融合核 ABFT 実装では、低精度 GEMM が FP32 レベルの閾値（$e_{ ext{max}} \ ext{約} 10^{-6}$）を使用可能であり、これは低精度出力のオフライン検証（$e_{\text{max}} \ ext{約} 10^{-3}$）と比較して $ ext{約} 1000$ imes 細かい検出粒度を実現します。本研究は A-ABFT の実験設定を再現し、元の論文の結果と実装を比較検証しました。本研究の手法は最大値、最小値、平均の統計量のみを使用する $O(n)$ の計算複雑性を持っていますが、これは A-ABFT の $p$ 個の最大の値を特定するための $O(pn)$ の計算複雑性よりも効率的です。合成データおよび実際のモデル重み（LLaMA-7B、GPT-2、ViT）に関する大規模な実験により、V-ABFT の多様な分布における効果性が確認されました。V-ABFT はプラットフォーム依存性を持たず、NPUs とも GPU ともに耐障害性 GEMM 実装に統合されています。

Original Content

arXiv:2602.08043v1 Announce Type: new Abstract: Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical challenges: analytical bounds are overly conservative, while probabilistic approaches like A-ABFT yield thresholds $160$--$4200\times$ larger than actual rounding errors. We present V-ABFT, a variance-based adaptive threshold algorithm that achieves tighter error bounds by directly modeling the verification difference. By leveraging statistical variance estimation, V-ABFT reduces the threshold-to-actual-error ratio to approximately $7$--$20\times$ for FP32/FP64 and $48$--$158\times$ for BF16, representing a \textbf{6--48$\times$ improvement} over A-ABFT while maintaining zero false positive rate across BF16, FP16, FP32, and FP64 precisions. Furthermore, we demonstrate that for fused-kernel ABFT implementations that verify before output quantization, low-precision GEMM can use FP32-level thresholds ($e_{\max} \approx 10^{-6}$), enabling \textbf{$\sim$1000$\times$ finer detection granularity} compared to offline verification with low-precision output ($e_{\max} \approx 10^{-3}$). We reproduce A-ABFT's experimental setup and validate our implementation against the original paper's results. Our method requires only $O(n)$ complexity using max/min/mean statistics, compared to A-ABFT's $O(pn)$ for finding $p$ largest values. Extensive experiments on synthetic data and real model weights (LLaMA-7B, GPT-2, ViT) demonstrate V-ABFT's effectiveness across diverse distributions. V-ABFT is platform-agnostic and has been integrated into fault-tolerant GEMM implementations on both NPUs and GPUs.