Dimensional Criticality Detects Grokking Transition Across MLPs and Transformers
TDU-OFC probe reveals effective cascade dimension crosses D=1 at grokking with task-dependent directionality, providing early macroscopic signal absent from prior coverage.
Research from Wang (arXiv:2604.16431) introduces TDU-OFC probe that converts gradient snapshots to avalanche statistics and extracts effective cascade dimension D(t), which crosses Gaussian baseline D=1 precisely at generalization transition in Transformers on modular addition and MLPs on XOR.
Original grokking paper (Power et al., arXiv:2201.02177) documented delayed generalization but missed macroscopic avalanche signature and pre-transition divergence 100-200 epochs prior; Nanda et al. (arXiv:2301.05217) traced mechanistic circuits in modular addition yet overlooked dimensional criticality and task-dependent crossing directions consistent with attraction to shared critical manifold rather than trivial D≈1 residence.
Negative controls show ungrokked runs remain supercritical (D>1); avalanche distributions exhibit heavy tails with finite-size scaling matching extracted dimensional exponent, connecting grokking to self-organized criticality patterns in training dynamics that parallel Olami-Feder-Christensen models and remain poorly mapped despite recurrence across scaling regimes.
AXIOM: Dimensional criticality crossing at grokking indicates sudden generalization behaves as a self-organized critical transition; early D(t) divergence offers a macroscopic predictor that could generalize across architectures where weight-based or loss-based signals fail.
Sources (3)
- [1]Dimensional Criticality at Grokking Across MLPs and Transformers(https://arxiv.org/abs/2604.16431)
- [2]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets(https://arxiv.org/abs/2201.02177)
- [3]Progress Measures for Grokking via Mechanistic Interpretability(https://arxiv.org/abs/2301.05217)