• General resources

    • https://blog.eleuther.ai/transformer-math/: training focused
    • https://kipp.ly/transformer-inference-arithmetic/: inference focused
    • https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4: 6PD
    • https://github.com/stas00/ml-engineering/blob/master/insights/ai-battlefield.md
  • Dims: compute, data, params

  • Estimations

    • Params: $12 d_m^2$ (4 for attention KQV and output, 2*4 for widening MLPs)
    • FLOPs ($P$ is params), ignoring attention (assumed small)
      • Inference/forward: $2P$ (1 add, 1 mul)
      • Backward $4P$, so train $6N$ total fwd+bwd
      • One epoch: $6PD$ ($D$ is data)
      • Full forward with activation recompute (selective checkpointing): $2PD \le C \le 4PD$
    • Memory
      • Model weights: usually $2P$ for 2-byte numbers, but int8 inference can do 1-byte
      • Inference: ~20% overhead (source)
      • Train (mem is usually the bottleneck for training)
        • Total = model + optim + activ + gradient
        • See full details
      • Optimizers
        • For vanilla AdamW, memoryoptimizer=(12 bytes/param)⋅(No. params)
          • fp32 copy of parameters: 4 bytes/param
          • Momentum: 4 bytes/param
          • Variance: 4 bytes/param
        • For 8-bit optimizers like bitsandbytes, memoryoptimizer=(6 bytes/param)⋅(No. params)
          • fp32 copy of parameters: 4 bytes/param
          • Momentum: 1 byte/param
          • Variance: 1 byte/param
        • For SGD-like optimizers with momentum, memoryoptimizer=(8 bytes/param)⋅(No. params)
          • fp32 copy of parameters: 4 bytes/param
          • Momentum: 4 bytes/param
  • Scaling laws

    • [2001.08361] Scaling Laws for Neural Language Models, OpenAI 2020
      • Plots

        Untitled

        • Right: train various size models on large dataset for long time (so data, compute not bottlenecks)
        • Middle: train various early stops with large model for long time (so params, compute not bottlenecks)
        • Most complex on left: train various model sizes on large dataset, show loss curve over time (thus various compute)
      • Formula

        Untitled

        • L : cross entropy loss
        • N : Number of model parameters excluding embedding size
        • D : Dataset size in number of tokens (generated by byte pair encoding)
        • Nc,Dc : Constants
        • If D tends to infinity, the second term goes to 0 and we get a power law in N i.e if we have infinite data, then training larger model reduces the cross entropy loss.
        • If N tends to infinity, we get a power law in D i.e if we have sufficiently large model, then increasing dataset size reduces the cross entropy loss.
        • Since alpha_D(0.095) is more than alpha_N(0.076), this implies that for the same compute budget, it is more important to increase the model size than the dataset size (because loss is proportional to (1/N) and (1/D) therefore, smaller parameter implies larger decay in loss).
        • As the compute budget increases, it is more important to increase the model size compared to dataset size or training time.
        • Larger models are more sample efficient as they require less training time to reach better performance
      • https://www.youtube.com/watch?v=UFem7xa3Q2Q

    • Chinchilla: A New AI Trend: Chinchilla (70B) Greatly Outperforms GPT-3 (175B) and Gopher (280B) | by Alberto Romero | Towards Data Science
      • https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla
      • Data, not size, is the currently active constraint on language modeling performance
      • The entire available quantity of data in highly specialized domains like code is woefully tiny
      • Chinchilla optimal models are optimized for reducing training expense, not inference efficiency https://news.ycombinator.com/item?id=34987435
  • Data

    • GPT-3 was trained on 3e11 tokens ~ 2e11 words, so 100X more
  • Training mem: 14-18x params (source)

    For comparison, for a 3B-parameter model, like “t5-3b”:

    • A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB)
    • Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra.
    • 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized.

    (source)

  • Networking hardware: see Interconnects in GPU computing

  • GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 TFLOP/s/A100 with Flash Attention. This is in line with other highly optimized libraries at scale, for example Megatron-DS reports between 137 and 163 TFLOP/s/A100.

  • Inference expense/hardware

    • Mistral 7B Q6_K quantized version, it requires ~8GB of RAM (source)
  • Tokens

    • Chinchilla is a 70B parameters model trained as a compute-optimal model with 1.4 trillion tokens (20 tokens per param)
    • GPT-3: 174B parameters. 300B tokens.
    • Phi-2: 2.7B parameters. 1400B tokens.