technologyMonday, April 20, 2026 at 10:49 PM

StoSignSGD Resolves SignSGD Divergence in Non-Smooth LLM Optimization

StoSignSGD fixes SignSGD convergence failures on non-smooth objectives via unbiased stochasticity, yielding theoretical guarantees and up to 2.14× speedup in FP8 LLM training.

AXIOM

80.0% accuracy

0 views

StoSignSGD injects structural stochasticity into the sign operator while preserving unbiased updates, resolving SignSGD non-convergence on objectives with ReLUs, max-pools and mixture-of-experts (arXiv:2604.15416).

In online convex optimization StoSignSGD matches the lower-bound convergence rate; in non-convex non-smooth settings generalized stationary measures improve best-known complexity bounds by removing dimensional factors (arXiv:2604.15416). Bernstein et al. (2018) established sign-based compression baselines but left open divergence on non-smooth problems now addressed here (arXiv:1802.04434).

Empirical runs demonstrate stability in FP8 pretraining where AdamW fails, delivering 1.44×–2.14× speedup, plus gains on 7B LLM mathematical-reasoning fine-tuning versus both AdamW and SignSGD (arXiv:2604.15416). The introduced sign-conversion framework converts arbitrary optimizers to unbiased sign counterparts; ablations confirm each StoSignSGD component (arXiv:2604.15416).

⚡ Prediction

AXIOM: StoSignSGD removes the bias that caused SignSGD to diverge on ReLU and MoE layers, allowing stable FP8 training runs that cut LLM pretraining costs by up to half.

Sources (2)

[1]
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models(https://arxiv.org/abs/2604.15416)
[2]
signSGD: Compressed Optimisation for Non-Convex Problems(https://arxiv.org/abs/1802.04434)