Making DL go brrr: memory bandwidth is the bottleneck, so make the most of your bandwidth usage aka avoid being memory bound by increasing compute intensity by fusing operations. (Also avoid being Python overhead bound)
Performance Regime | Plausible Solutions |
---|---|
Overhead-Bound | Tracing, Operator Fusion, don't use Python, a real JIT :^) |
Bandwidth-Bound | Operator Fusion |
Compute-Bound | Use Tensor Cores, give Nvidia more money |
From Great post on transformer FLOPs