See also
General resources
DDP: simple, but requires full model on each node
Model parallelism: ambiguous term, refers to either pipeline or tensor, usu. tensor
Thinking about efficiency
Pipeline parallelism: inter-layer
Tensor parallelism: intra-layer, chatty
Piece by piece (source)
3D parallelism: DP, MP, PP combined
Maybe helpful:
ZeRO/FSDP: partitions optimizer state, weights, gradients
This is a form of data parallelism
Partitioning weights may interfere with MP/PP?
Animations (source)
Stage 1: just optimizer state
Stage 2: gradients
Stage 3: weights, very chatty?
Sequence parallelism aka context parallelism (e.g. in DS Ulysses): partition the input sequence. (paper)
Doesn’t partition any model memory, so more similar to data parallelism (pure compute parallelism / chopping up of inputs and activations), but with the communication of tensor parallelism for attention (since all-to-all communication needed)
Ring self-attention algorithm to compute attention by passing around keys then values in a circle
Actually overcomputes (explainer)—see red not needed below
Ring attention
The use of a ring topology for computing self-attention has also been studied in prior work [21] but it incurs non-overlapped communication overheads similar to sequence parallelism, making it infeasible for large context sizes …
Prior work has also proposed leveraging a ring topology to compute self-attention [21], aiming to reduce communication costs. Our work differs by utilizing blockwise parallel transformers to substantially reduce memory costs. As we show in the next section, this enables zero-overhead scaling of context size during both training and inference and allows arbitrarily large context size. …
Prior work extends sequence parallelism for computing self-attention using a ring topology [21], which reduces the communication cost compared to standard sequence parallelism. However, overlapping communication with computation remains challenging due to the constraints of arithmetic intensity. The communication overheads render this approach infeasible for training and inference in largecontext scenarios. Our work leverages on blockwise parallel transformers to distribute blockwise attention and feedforward across devices and concurrently overlaps the communication of key-value blocks in a circular of hosts with the computation of query-key-value blocks and feedforward, reducing memory cost substantially and allowing device count times larger context size with zero overheads.
MoE: mixture of experts (foundations)
More on MoE routing schemes
MoE expert parallelism
How parallel groups work, worked out
Offload, ZeRO Offload