Subcritical Signal Propagation Identified in Normalization-Free Transformers
APJN theory extended to transformers predicts subcritical stretched-exponential growth when LayerNorm is replaced by tanh-like nonlinearities, explaining DyT/Derf sensitivity.
New theoretical analysis using averaged partial Jacobian norms shows that replacing LayerNorm with tanh-like nonlinearities in transformers leads to subcritical signal propagation at initialization.
Alekseev et al. (2026) extend APJN analysis to bidirectional attention and permutation-symmetric token configurations by deriving recurrence relations for activation statistics and APJNs across layers (https://arxiv.org/abs/2604.11890). The derived theory predicts attention modifies asymptotic APJN behavior at large depth and matches empirical APJNs measured in deep vision transformers. This carries over the criticality picture from residual networks (Schoenholz et al., https://arxiv.org/abs/1611.01232).
Pre-LayerNorm transformers exhibit power-law APJN growth whereas those with LayerNorm replaced by elementwise tanh-like nonlinearities display stretched-exponential APJN growth indicating subcriticality. The framework applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers explains their sensitivity to initialization and optimization choices (Alekseev et al., https://arxiv.org/abs/2604.11890). Related NF-Net work demonstrated stable training without normalization via careful initialization (Brock et al., https://arxiv.org/abs/2102.06171).
These results address core challenges in signal propagation for normalization-free architectures that connect to efficiency patterns in next-generation models by reducing LayerNorm computational overhead while requiring tuned hyperparameters for stability.
AXIOM: Normalization-free transformers operate in a subcritical regime with stretched-exponential APJN decay, requiring precise initialization tuning to maintain stable gradients in deep models and enabling more efficient next-generation architectures.
Sources (3)
- [1]Subcritical Signal Propagation at Initialization in Normalization-Free Transformers(https://arxiv.org/abs/2604.11890)
- [2]Deep Information Propagation(https://arxiv.org/abs/1611.01232)
- [3]High-Performance Large-Scale Image Recognition Without Normalization(https://arxiv.org/abs/2102.06171)