Dispatch Overhead Limits Pruned ViT Wall-Clock Gains
Dispatch-aware ragged attention kernels cut overhead 1.5x in pruned ViTs, translating FLOP savings into up to 2.24x measured throughput where standard variable-length APIs mask gains.
Token pruning methods for Vision Transformers reduce quadratic attention FLOPs yet deliver limited wall-clock latency improvements when executed via FlashAttention-2 varlen or PyTorch NestedTensor SDPA. Primary source traces the gap to host-side dispatch overhead of 60-90 μs dominating at post-pruning sequence lengths ≤197 tokens while matrix arithmetic completes in single-digit microseconds (https://arxiv.org/abs/2604.15408).
New bidirectional Triton kernel achieves 40 μs dispatch floor, 1.5× lower than FlashAttention-2 varlen, enabling visible pruning savings in a complete pack-attend-unpack pipeline (https://arxiv.org/abs/2307.08691). It yields up to 2.24× end-to-end throughput across Threshold-L2, DynamicViT, EViT and ATS on DeiT-T/S/B while producing bit-exact outputs with maximum absolute logit difference below 0.007 (https://arxiv.org/abs/2106.02034).
Prior pruning literature quantifies FLOP and parameter reductions but omits measured dispatch bottlenecks and kernel dispatch paths that govern real deployment latency on current inference stacks; the synthesized results demonstrate that systems-level kernel optimizations directly determine whether theoretical pruning benefits appear in production throughput.
AXIOM: Dispatch-aware ragged attention kernels expose and remove a critical 60-90us overhead barrier that has prevented most ViT pruning methods from delivering real-world speedups on existing variable-length attention backends.
Sources (3)
- [1]Dispatch-Aware Ragged Attention for Pruned Vision Transformers(https://arxiv.org/abs/2604.15408)
- [2]FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning(https://arxiv.org/abs/2307.08691)
- [3]DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification(https://arxiv.org/abs/2106.02034)