Refs

Rafi Witten course using Jax and TPUs
Great reference post on transformer FLOPs

Making DL go brrr: memory bandwidth is the bottleneck, so make the most of your bandwidth usage aka avoid being memory bound by increasing compute intensity by fusing operations. (Also avoid being Python overhead bound)

Performance Regime	Plausible Solutions
Overhead-Bound	Tracing, Operator Fusion, don't use Python, a real JIT :^)
Bandwidth-Bound	Operator Fusion
Compute-Bound	Use Tensor Cores, give Nvidia more money

Inference’s bottleneck is memory bandwidth

Thinking about efficiency

Training: 6PD (for the 3 matmuls in fwd + bwd) is our expected TFLOP. Compute 6PD/time for a single step and see what percent of device’s max TFLOP/s this is (e.g. TPU v4 has 275TFLOP/s)
Inference throughput is based on how many sequences you can parallel process per chip. E.g. device memory (TPU v4-8 has 32GB*4=128GB) = weights (7B params = 14GB) + 0.94GB per sequence’s KV (for Gemma) * num sequences. Solve for num sequences ~= 121. For margin of error, round down to 110 which totals 117GB not the full 128GB. So load 117GB to generate 110 tokens! (source)
Balancing sharding (source)
- Training: usu FSDP with a bit of tensor parallel (FSDP can scale out a lot, but at some point you’re consuming too many tokens for a single step)
- Inference: usu tensor parallel—FSDP loads all params over ICI, and in inference we’re already bottlenecked on loading all params from HBM so loading from ICI would be even slower

Params vs FLOPs

From Great post on transformer FLOPs