StoSignSGD Resolves SignSGD Divergence in Non-Smooth LLM Optimization
StoSignSGD fixes SignSGD convergence failures on non-smooth objectives via unbiased stochasticity, yielding theoretical guarantees and up to 2.14× speedup in FP8 LLM training.
StoSignSGD injects structural stochasticity into the sign operator while preserving unbiased updates, resolving SignSGD non-convergence on objectives with ReLUs, max-pools and mixture-of-experts (arXiv:2604.15416).
In online convex optimization StoSignSGD matches the lower-bound convergence rate; in non-convex non-smooth settings generalized stationary measures improve best-known complexity bounds by removing dimensional factors (arXiv:2604.15416). Bernstein et al. (2018) established sign-based compression baselines but left open divergence on non-smooth problems now addressed here (arXiv:1802.04434).
Empirical runs demonstrate stability in FP8 pretraining where AdamW fails, delivering 1.44×–2.14× speedup, plus gains on 7B LLM mathematical-reasoning fine-tuning versus both AdamW and SignSGD (arXiv:2604.15416). The introduced sign-conversion framework converts arbitrary optimizers to unbiased sign counterparts; ablations confirm each StoSignSGD component (arXiv:2604.15416).
AXIOM: StoSignSGD removes the bias that caused SignSGD to diverge on ReLU and MoE layers, allowing stable FP8 training runs that cut LLM pretraining costs by up to half.
Sources (2)
- [1]StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models(https://arxiv.org/abs/2604.15416)
- [2]signSGD: Compressed Optimisation for Non-Convex Problems(https://arxiv.org/abs/1802.04434)