LLM scaling algorithms (systems)

See also
- Megatron-DeepSpeed
General resources
- https://lilianweng.github.io/posts/2021-09-25-train-large/
- https://github.com/rwitten/HighPerfLLMs2024 - Rafi’s excellent series based in Jax
DDP: simple, but requires full model on each node
Model parallelism: ambiguous term, refers to either pipeline or tensor, usu. tensor
Pipeline parallelism: inter-layer
- Split batches into microbatches
- Various algorithms that minimize time bubbles
- DeepSpeed does this:
Tensor parallelism: intra-layer, chatty
- Clever: the MLP has 2 GEMMs. There are 2 ways to split A for the $XA$ mult: columns or rows. Splitting rows requires sync-summing before nonlinearity activation. Cols produces complete elements per half, but requires full input token. So, use cols for the first GEMM and rows for the second, saving the sync (g in the figure below) for the end!
- Attention: simply parallelize by head.
- Piece by piece (source)
  
  tensor parallel.mov
3D parallelism: DP, MP, PP combined
- Maybe helpful:
ZeRO/FSDP: partitions optimizer state, weights, gradients
- This is a form of data parallelism
- Partitioning weights may interfere with MP/PP?
- Animations (source)
  
  fsdp.mov
  
  fsdp overlap.mov
- Stage 1: just optimizer state
- Stage 2: gradients
- Stage 3: weights, very chatty?
https://youtu.be/kW-uQ-1pm38

Untitled

Sequence parallelism aka context parallelism (e.g. in DS Ulysses): partition the input sequence. (paper)
- Doesn’t partition any model memory, so more similar to data parallelism (pure compute parallelism / chopping up of inputs and activations), but with the communication of tensor parallelism for attention (since all-to-all communication needed)
- Ring self-attention algorithm to compute attention by passing around keys then values in a circle
  - Video explainer, in-depth (haven’t watched)
  - Kinda blockwise, but not as blockwise as Ring Attention?
  - Algo
    - Each partition starts off with just its own tokens’ Q/K/V embeddings.
    - Each partition computes its own A=QK’
    - Each partition then sends its K to its neighbor, so that neighbor can compute more tokens’ attentions.
    - Each partition then repeats to get final embedding=AV
- Actually overcomputes (explainer)—see red not needed below
Ring attention
- Like ring self-attention introduced in sequence parallelism work, but uses block-wise
  - Sequence Parallelism (2021): Focuses purely on distributing the sequence across devices, keeping full key-value blocks in memory on each device
  - Ring Attention (2023): Uses blockwise computation from Flash Attention AND ring communication pattern, meaning each device only needs to store small blocks at a time
The use of a ring topology for computing self-attention has also been studied in prior work [21] but it incurs non-overlapped communication overheads similar to sequence parallelism, making it infeasible for large context sizes …

Prior work has also proposed leveraging a ring topology to compute self-attention [21], aiming to reduce communication costs. Our work differs by utilizing blockwise parallel transformers to substantially reduce memory costs. As we show in the next section, this enables zero-overhead scaling of context size during both training and inference and allows arbitrarily large context size. …

Prior work extends sequence parallelism for computing self-attention using a ring topology [21], which reduces the communication cost compared to standard sequence parallelism. However, overlapping communication with computation remains challenging due to the constraints of arithmetic intensity. The communication overheads render this approach infeasible for training and inference in largecontext scenarios. Our work leverages on blockwise parallel transformers to distribute blockwise attention and feedforward across devices and concurrently overlaps the communication of key-value blocks in a circular of hosts with the computation of query-key-value blocks and feedforward, reducing memory cost substantially and allowing device count times larger context size with zero overheads.
- Similar to blockwise attention (BlockBERT)—also computes blockwise, but distributed/parallel
Striped attention
Flash decoding (blog post)

FlashAttention parallelizes across blocks of queries and batch size only.

But during decoding/inference, we only need one new query at a time.

So we don’t not manage to occupy the entire GPU during token-by-token decoding.

See how slow it is:

Instead, we can just parallelize across KVs:
MoE: mixture of experts (foundations)
- Training the router: the learning process is smeared over time if you are switching to just one or another.
  - Router receives gradients based on how well the chosen experts performed for each token.
  - If an expert performed well (reduced the loss), the router will receive a signal to increase the probability of choosing that expert for similar inputs in the future.
  - If an expert performed poorly, the router will receive a signal to decrease the probability of choosing that expert for similar inputs.
- Sparse Gradients and Straight-Through Estimator TODO
  - The top-k selection process is not differentiable. To allow gradients to flow, techniques like the Straight-Through Estimator are often used (see Machine learning foundations ). This allows the router to receive approximate gradients even though the selection process is discrete.
- Upcycling
More on MoE routing schemes
MoE expert parallelism
How parallel groups work, worked out
Offload, ZeRO Offload

Mixture of depths