LLM scaling | Notion

General resources
- https://blog.eleuther.ai/transformer-math/: training focused
- https://kipp.ly/transformer-inference-arithmetic/: inference focused
- https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4: 6PD
- https://github.com/stas00/ml-engineering/blob/master/insights/ai-battlefield.md
Dims: compute, data, params
Estimations
- Params: $12 d_m^2$ (4 for attention KQV and output, 2*4 for widening MLPs)
- FLOPs ($P$ is params), ignoring attention (assumed small)
  - Inference/forward: $2P$ (1 add, 1 mul)
  - Backward $4P$, so train $6N$ total fwd+bwd
  - One epoch: $6PD$ ($D$ is data)
  - Full forward with activation recompute (selective checkpointing): $2PD \le C \le 4PD$
- Memory
  - Model weights: usually $2P$ for 2-byte numbers, but int8 inference can do 1-byte
  - Inference: ~20% overhead (source)
  - Train (mem is usually the bottleneck for training)
    - Total = model + optim + activ + gradient
    - See full details
  - Optimizers
    - For vanilla AdamW, memoryoptimizer=(12 bytes/param)⋅(No. params)
      - fp32 copy of parameters: 4 bytes/param
      - Momentum: 4 bytes/param
      - Variance: 4 bytes/param
    - For 8-bit optimizers like bitsandbytes, memoryoptimizer=(6 bytes/param)⋅(No. params)
      - fp32 copy of parameters: 4 bytes/param
      - Momentum: 1 byte/param
      - Variance: 1 byte/param
    - For SGD-like optimizers with momentum, memoryoptimizer=(8 bytes/param)⋅(No. params)
      - fp32 copy of parameters: 4 bytes/param
      - Momentum: 4 bytes/param
Scaling laws
- [2001.08361] Scaling Laws for Neural Language Models, OpenAI 2020
  - Plots
    - Right: train various size models on large dataset for long time (so data, compute not bottlenecks)
    - Middle: train various early stops with large model for long time (so params, compute not bottlenecks)
    - Most complex on left: train various model sizes on large dataset, show loss curve over time (thus various compute)
  - Formula
    - L : cross entropy loss
    - N : Number of model parameters excluding embedding size
    - D : Dataset size in number of tokens (generated by byte pair encoding)
    - Nc,Dc : Constants
    - If D tends to infinity, the second term goes to 0 and we get a power law in N i.e if we have infinite data, then training larger model reduces the cross entropy loss.
    - If N tends to infinity, we get a power law in D i.e if we have sufficiently large model, then increasing dataset size reduces the cross entropy loss.
    - Since alpha_D(0.095) is more than alpha_N(0.076), this implies that for the same compute budget, it is more important to increase the model size than the dataset size (because loss is proportional to (1/N) and (1/D) therefore, smaller parameter implies larger decay in loss).
    - As the compute budget increases, it is more important to increase the model size compared to dataset size or training time.
    - Larger models are more sample efficient as they require less training time to reach better performance
  - https://www.youtube.com/watch?v=UFem7xa3Q2Q
- Chinchilla: A New AI Trend: Chinchilla (70B) Greatly Outperforms GPT-3 (175B) and Gopher (280B) | by Alberto Romero | Towards Data Science
  - https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla
  - Data, not size, is the currently active constraint on language modeling performance
  - The entire available quantity of data in highly specialized domains like code is woefully tiny
  - Chinchilla optimal models are optimized for reducing training expense, not inference efficiency https://news.ycombinator.com/item?id=34987435
Data
- GPT-3 was trained on 3e11 tokens ~ 2e11 words, so 100X more
Training mem: 14-18x params (source)
For comparison, for a 3B-parameter model, like “t5-3b”:
- A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB)
- Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra.
- 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized.
(source)
Networking hardware: see Interconnects in GPU computing
GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 TFLOP/s/A100 with Flash Attention. This is in line with other highly optimized libraries at scale, for example Megatron-DS reports between 137 and 163 TFLOP/s/A100.
Inference expense/hardware
- Mistral 7B Q6_K quantized version, it requires ~8GB of RAM (source)
Tokens
- Chinchilla is a 70B parameters model trained as a compute-optimal model with 1.4 trillion tokens (20 tokens per param)
- GPT-3: 174B parameters. 300B tokens.
- Phi-2: 2.7B parameters. 1400B tokens.