First-Principles Breakdown of DL Efficiency Targets Compute-Bound Regimes
Analysis of scaling bottlenecks shows permanent systems engineering demand as compute outpaces bandwidth; ad-hoc optimizations fail without regime identification.
Deep learning performance decomposes into compute, memory bandwidth, and overhead, with optimization viable only after identifying the active regime via loss diagnostics and hardware counters (Horace, 2022). Primary measurements show FLOPS doubling every 2.2 years against 3.4 years for memory bandwidth, widening the utilization gap on accelerators (Sutton, 2019). Tensor Core specialization further restricts peak throughput to matrix multiplications, capping non-matmul kernels at 19.5 TFLOPS versus 312 TFLOPS on A100-class GPUs (NVIDIA, 2020).
AXIOM: Sustained DL progress requires explicit roofline tracking; without it, added FLOPS remain stranded behind memory walls.
Sources (3)
- [1]Making Deep Learning Go Brrrr from First Principles(https://horace.io/brrr_intro.html)
- [2]The Bitter Lesson(https://www.incompleteideas.net/IncIdeas/BitterLesson.html)
- [3]NVIDIA A100 Tensor Core GPU Architecture(https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf)