General resources
Dims: compute, data, params
Estimations
Scaling laws
Plots
Formula
Data
Training mem: 14-18x params (source)
For comparison, for a 3B-parameter model, like “t5-3b”:
- A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB)
- Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra.
- 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized.
(source)
Networking hardware: see Interconnects in GPU computing
GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 TFLOP/s/A100 with Flash Attention. This is in line with other highly optimized libraries at scale, for example Megatron-DS reports between 137 and 163 TFLOP/s/A100.
Inference expense/hardware
Tokens